R·ex / Zeng


MUGer, hacker, developer, amateur UI designer, punster, Japanese learner.

A Trip of Golang Memory Leak Troubleshooting

"Our user reported a live issue, could you please take a look?"

As I started writing backend projects in Golang, my services always become abnormal in unexpected situations, and my job is to solve the problem.

This time, the problem seems to be a memory leak in the backend service of the operation platform, so let's start with the memory curve.

A Sudden Surge in The Curve

The curve given by the Grafana platform roughly conforms to the following characteristics: there are 8 mutation points, each mutation will make the memory rise sharply to nearly 500 MB, rise to near the container limit, and then suddenly fall to 100 MB (because the container crashed and restarted).

A Failed Static Analysis Attempt

Combining the time of mutation, I searched the audit log and found that the business uploaded 8 excel files. But since the specific upload log was lost when the container crashed, I had to use social engineering directly message the business team through PM to get the excel files.

After observing, each file has 40,000 records, and after parsing, it just happened to occupy about 500 MB of memory (yes, for scenarios with a large number of duplicate cells, excel has a super high compression rate, so a file with few MB can be parsed into hundreds of MB in memory). But the problem is: why is it a 500 MB increase each time? Is it because the variable used to store the parsing result is hung on the global or closure? However, it seems that there is no such thing.

Using pprof to Compare Snapshots

If static analysis doesn't reveal the problem, I have to use Golang's most famous analysis tool - pprof. But as a newcomer, I studied the heap / allocs pages for a long time and didn't find any issues. Finally, I found this article, which shows that pprof can do a diff between two snapshots! The general approach is as follows:

# First execute
$ curl -s http://127.0.0.1:6060/debug/pprof/heap > base.heap

# After a period of time, when the memory grows, execute again
$ curl -s http://127.0.0.1:6060/debug/pprof/heap > current.heap

# Compare the differences between the two snapshots
$ go tool pprof --base base.heap current.heap

Then, I generated an SVG through pprof. After opening it, there was no problem, and even some functions had negative memory changes... Even if I ran the service on my Alienware and crazily uploaded large files, after a few minutes, the memory also returned to the initial level.

MADV_FREE and MADV_DONTNEED

After some searching, I found that starting from Golang 1.12, MADV_FREE is used by default to return memory to the system on Linux >= 4.5 kernels, while the previous mode was MADV_DONTNEED. After consulting the Linux manual page, I have a rough understanding:

  • MADV_DONTNEED tells the OS kernel when releasing memory: I no longer need this memory. Although the kernel will decide when to release this memory on its own, the process's RSS will definitely drop immediately.
  • Memory marked with MADV_FREE will only be released by the kernel when we are running out of memory. It can still be reused before being released.

In general, the new feature can speed up the release of memory, but it will cause the process's RSS not to drop immediately. This should be some considerations of Golang, but my intuition tells me that this may be the reason.

Combining some basic knowledge of OS I learned before, I can roughly draw a conclusion: in theory, this memory should be reusable, but after the service runs for a period of time, some of it has already been reused. When the user uploads files again, the service tries to allocate a memory block of the same size. Since the available area of this memory no longer meets the requirements, the system can only reallocate another block elsewhere, which causes the RSS to surge continuously.

There are several ways to solve it:

  1. Considering that the QPS requirements for operation platforms are not high, we can set GODEBUG=madvdontneed=1 to allow idle memory to be reclaimed by the system as soon as possible;
  2. Force the maximum number of uploads to 1,000 records, and add a prompt "Upload up to 1,000 records, you can only upload the data that needs to be added or modified" when exceeding the limit;
  3. Use a streaming excel parsing library to avoid allocating too much memory at once.

However, considering that this requirement is about to be offline, and there will be no such large-scale file upload requirements in the short term, the priority is not high. In the end, we did not make any changes.

To avoid losing logs after the container crashes, we made a technical optimisation to save the logs of each request to the file service. Later, regardless of whether the container crashes, we can find the specific development logs and even the uploaded files based on the time.

Borrowing Without Returning Leads to Overflow

"Look at this curve, does your container have a memory leak?" This time it was the SRE colleague who came to me.

Indeed, a casual glance at the graph given by the SRE shows that it is a memory leak, but this time the curve is very different from the last time - it grows at a relatively slow rate. It looks like some small variables are retained for each request.

This time, the service that went wrong was our data reporting service. In the absence of a clear clue, I opened pprof again as usual.

Suspecting Gin

This time, the big part of memory usage is Golang's gzip package. As far as I know, gzip is only used in one place: Gin's request compression, which means that there may be an unclosed Reader / Writer. So I went to the Gin gzip package to find the reason, and found that it used a pool, which first took an instance from it, then set some Gin response headers, and finally connected the instance's writer to the Gin context. At #L57, defer gz.Close() was executed and the instance was returned to the pool, which seemed to have borrowed, returned, and closed, so it wouldn't cause a memory leak.

Since there is no problem with Gin's gzip usage, could it be a problem with Golang's gzip package itself? I clicked into the Close function and found this code:

func (z *Writer) Close() error {
    // ...
    if !z.wroteHeader {
        z.Write(nil)
        if z.err != nil {
            return z.err
        }
    }
    z.err = z.compressor.Close()
    // ...
}

If there is a problem with the writer, it will not be closed, it must be caused by this! I set a breakpoint at the return here, and sure enough, z.err is like this:

http: request method or response status code does not allow body

According to the error message, I found that the request method is OPTIONS (the reporting service must have a cross-domain problem), and indeed there should be no body.

In order to support cross-domain, we used the gin-contrib/cors middleware. I recalled that its principle is to directly return the result after finding the OPTIONS request and passing through the whitelist filter, instead of continuing c.Next(). So why doesn't it work?

After looking at the code that registers the middleware, I found that we put the gzip middleware first, followed by the CORS middleware... Changing the order solved the problem.

Can Errors Not Be Returned?

So why doesn't Golang automatically close the compressor when an error occurs? After looking at the Write method of the Golang gzip package, I came up with a reason: when the Write method is called multiple times, as long as the first call is successful, there will be an open compressor. If an error occurs later, Golang doesn't know how the developer wants to handle it, so it doesn't make a decision on its own. If the developer:

  • wants to ignore the error and continue writing in, they can call the gz.Reset(newWriter) method, which will keep the compressor and level, and reset all other fields;
  • wants to stop here, they can directly panic according to the writing method in the example, or Reset first and then Close.

It is worth mentioning that Golang recommends developers to reuse the gzip instance as much as possible using Reset, which may be the reason why the Gin gzip middleware uses a pool.

The Lonely Timer

As the request volume of the data reporting service gradually increased, the SRE came to me again, this time it was still a memory leak problem, and the memory curve was steadily rising. But what's different this time is that only the live environment has this problem, while the test environment is completely fine.

Goroutine Leak Without Traces

After experiencing several memory leaks, I have summarized one point: to save the world, please use pprof heap diff. However, this time, the silver bullet failed, and I didn't find any problems from the heap diff... The data reporting service should be a very simple service, and there shouldn't be too many problems.

But it seems that Golang doesn't have other tools to analyze memory problems, so I have to re-examine the functions provided by pprof. On the home page, I found that there are tens of thousands of goroutines running. This is a point that I have never considered before - goroutine leak.

Returning to the normal heap profile page, I found a huge amount of stacktraces of goroutines pointing to the same function, and the conclusion seemed obvious. I clicked into this function and found out why the diff had no effect: this function itself did not define any new variables or generate any new literals, so of course there would be no traces in the heap.

Timers Can't Be Stopped Casually

The function is for batch submission: the data received each time will be pushed into a queue first, and then this function will set a timer (if there is already a timer, it will return directly), and ten seconds later, the data in the queue will be sent to a batch endpoint. But to ensure that the content in the queue is not too much, as long as there are more than 5,000 pieces of data, a new queue will be started to replace the current queue, and then the data in the replaced queue will be submitted in batches directly. In JavaScript, this function is very simple:

let q: ReportData[] = [];
let timerId = 0;

// Receive a single piece of data, if there is too much data, submit directly, otherwise submit at most 10s later
const pushBack = (data: ReportData) => {
    q.push(data);
    if (q.length > 5000) {
        flush();
    } else {
        scheduleFlush();
    }
    return data;
};

// Set the timer
const scheduleFlush = () => {
    if (timerId) return;
    timerId = setTimeout(() => flush(), 10 * 1000);
};

// The actual submission action which will clear the timer and create a new queue
const flush = () => {
    setTimeout(() => commit(q));
    clearTimeout(timerId);
    q = [];
};

But in Golang, timers need to use time.Ticker, so my code looks like this:

func (q *MemQueue) PushBack(ele interface{}) interface{} {
    q.mu.Lock()
    defer q.mu.Unlock()

    q.queue[q.index].PushBack(ele)
    q.len++
    if q.len >= 5000 {
        q.flush()
    } else {
        q.scheduleFlush()
    }

    return ele
}

func (q *MemQueue) scheduleFlush() {
    if q.ticker != nil {
        return
    }
    d := time.Duration(10 * int64(time.Second))
    q.ticker = time.NewTicker(d)
    go func() {
        for {
            if q.ticker == nil {
                return
            }
            <-q.ticker.C
            q.flush()
        }
    }()
}

func (q *MemQueue) flush() {
    go commit(q.queue[q.index])
    q.index = 1 - q.index
    q.len = 0
    q.ticker.Stop()
    q.ticker = nil
    q.queue[q.index] = list.New()
}

The position that was pointed to by the stacktrace before was the goroutine in scheduleFlush, which was stuck at <-q.ticker.C, meaning that the timer had stopped. Indeed, when the data exceeds 5,000 records, the flush function will be called directly. After stopping the timer, although the ticker will be set to nil, since <-q.ticker.C will no longer receive data, it will not continue to the next loop. When the next data is pushed over, a new goroutine will be started in scheduleFlush.

Let the Channel Save the Day

Knowing the reason, I searched a bit and found this answer, and the solution was simple: use an additional channel to keep the loop going.

go func() {
    for {
        if q.ticker == nil {
            return
        }
        select {
        case <-q.forceStop:
            return
        case <-q.ticker.C:
            q.flush()
            return
        }
    }
}()

// Where you want to stop the timer
q.forceStop <- true
q.ticker.Stop()
q.ticker = nil

After deployment, the problem was solved. So why did only the live environment have problems, while the test environment didn't? Because the flush function will be called in two places - scheduleFlush and PushBack. The problem occurred when flush was called directly from PushBack, and this condition was that the queue needed to accumulate 5,000 records within 10s. The QPS of the test environment was not high enough to reach this level, so the problem did not occur.

SRE Alerts Overflowing the Screen

One day after that, I was directly added to the SRE alert group. The latest message in the group was: [error] 11:52 AM ssc k8s pod mem alert reached 92.60%, and before that, there were five messages reporting the same problem.

Me (after looking at the K8S panel): "It is a known memory leak problem. Can a grandee help me to restart it temporarily? We will solve it after the next release."

It seems that I accidentally set a flag for myself, because the next release is only two days away. This project is the backend service of the operation platform mentioned at the beginning of this article.

Problems That GC Can't Solve

The memory curve this time looks very similar to the beginning, with sudden spikes. I originally thought that: the department has developed for a year, the amount of request has increased, and the business scenarios have also increased, so this time I really need to force the memory to be released regularly. So I added a timer:

func freeMemory(r *registry.Registry) {
    debug.FreeOSMemory()
    r.BasicLogger.Info("free-memory", "Free memory successful")
}

c := cron.New()
c.AddJob("@every 10m", &freeMemoryJob{
    f: freeMemory,
})
c.Start()

And then I found that the curve looked like this:

K8S monitoring curve during memory leak

I can only express my feelings at this time in one sentence: crying.jpg. At that time, my boss, who was originally a backend developer, happened to pass by my seat. I hope he didn't see this picture.

The reason why there is a peak every hour in the picture is that we have a scheduled task that runs hourly, which may occupy some memory; the reason why it dropped a lot later is because I redeployed it...

Be Careful with Global Maps

Fine, there should still be some small variables that I forgot to release. I used the same trick again and opened pprof.

Call graph drawn by pprof

With my understanding of my own code, I know what this part of the code is for: it is a technical optimization proposed when solving the initial problem - saving development logs to the file service. How to collect development logs that run through the entire service for a request? One relatively easy way is to set a global variable map[string][]string, which is a mapping from trace ID to the log list of the request. You can add logs to it at any time during the request, save the content of the log list to the file service through the middleware after the request ends, and then delete the data corresponding to this trace ID.

Everything looks seamless - no matter how the process and result of each request are, the content in the map will definitely be deleted in the end, but the curve looks like it hasn't been deleted. The reason why this problem did not occur before was that the amount of logs was relatively small, and the memory growth was not obvious. Now that the amount of logs has increased, this problem has occurred.

Under the pressure of the SRE grandees, I searched by combining the words golang map delete memory leak and also looked at the implementation of map, and referred to this article. I found a problem: when deleting a key in Golang's map, it only marks the key as empty instead of deleting it completely, so that some time can be saved when the same key is inserted next time. Considering that map uses a hash table, frequent resizing does affect performance, so this design looks reasonable.

But there is a serious problem with my use case: my key is the trace ID, which means that there is almost no possibility of repetition, and almost all values will be retained. The previous QPS was very low, and the amount of logs was also small, so this problem was not so obvious; now some scheduled tasks have thousands of logs, and this problem has been magnified.

How to solve it? In fact, it is very simple, just remove the global map, use middleware to generate a new logger in each Gin context, and finally only upload the content of this logger. In addition, during this period, the company's infra team developed a self-built log platform, and I also unhesitatingly accepted it, so the logs no longer need to be saved to the file service.

After running for a while and looking at the monitor, I found that the memory leak had been solved; I used pprof diff and found that there were no more functions marked in red, and even some functions were marked in green, meaning that the memory usage of these functions had decreased.

It can be seen that the memory leak has been solved

The reason why there was still a sign of memory increase at the beginning was that the service cached the user data for efficiency. The amount of data was at most tens of thousands, which did not put much pressure on the memory.

Summary

After this period of struggle with memory leaks being completely defeated, I found that I seemed to have encountered all common memory leak problems once, which can be considered as accumulating some experience. I am not sure if there are still potential memory leak risks in the current project, but I have enough confidence to solve both unknown and known problems.

Disqus is loading... If it fails to load, please add disqus.com and disquscdn.com to your whitelist.

We've been together for

3874 days