DEV Community

peterkarp for BuzzFeed Tech

Posted on

Finding and Fixing Memory Leaks in Python

One of the major benefits provided in dynamic interpreted languages such as Python is that they make managing memory easy. Objects such as arrays and strings grow dynamically as needed and their memory is cleared when no longer needed. Since memory management is handled by the language, memory leaks are less common of a problem than in languages like C and C++ where it is left to the programmer to request and free memory.

The BuzzFeed technology stack includes a micro-service architecture that supports over a hundred services many of which are built with Python. We monitor the services for common system properties such as memory and load. In the case of memory a well-behaved service will use memory and free memory. It performs like this chart reporting on the memory used over a three-month period.

a graph showing expected memory usage over three months

A microservice that leaks memory over time will exhibit a saw-tooth behavior as memory increases until some point (for example, maximum memory available) where the service is shut down, freeing all the memory and then restarted.

a graph showing irregular spikes and dips in memory usage

Sometimes a code review will identify places where underlying operating system resources such as a file handle are allocated but never freed. These resources are limited and each time they are used they allocate a small amount of memory and need to be freed after use so others may use them.

This post first describes the tools used to identify the source of a memory leak. It then presents a real example of a Python application that leaked memory and how these tools were used to track down the leak

Tools

If a code review does not turn up any viable suspects, then it is time to turn to tools for tracking down memory leaks. The first tool should provide a way to chart memory usage over time. At BuzzFeed, we use DataDog to monitor microservices performance. Leaks may accumulate slowly over time, several bytes at a time. In this case, it is necessary to chart the memory growth to see the trend.

The other tool, tracemalloc, is part of the Python system library. Essentially tracemalloc is used to take snapshots of the Python memory. To begin using tracemalloc first call tracemalloc.start() to initialize tracemalloc, then take a snapshot using:

snapshot=tracemalloc.take_snapshot()

tracemalloc can show a sorted list of the top allocations in the snapshot using the statistics() method on a snapshot. In this snippet the top 5 allocations grouped by source filename are logged.

for i, stat in enumerate(snapshot.statistics('filename')[:5], 1):
    logging.info("top_current",i=i, stat=str(stat))

The output will look similar to this:

1 /usr/local/lib/python3.6/ssl.py:0: size=2569 KiB, count=41, average=62.7 KiB
2 /usr/local/lib/python3.6/tracemalloc.py:0: size=944 KiB, count=15152, average=64 B
3 /usr/local/lib/python3.6/socket.py:0: size=575 KiB, count=4461, average=132 B
4 /usr/local/lib/python3.6/site-packages/tornado/gen.py:0: size=142 KiB, count=500, average=290 B
5 /usr/local/lib/python3.6/mimetypes.py:0: size=130 KiB, count=1686, average=79 B

This shows the size of the memory allocation, the number of objects allocated and the average size each on a per-module basis.

We take a snapshot at the start of our program and implement a callback that runs every few minutes to take a snapshot of the memory. Comparing two snapshots shows changes with memory allocation. We compare each snapshot to the one taken at the start. By observing any allocation that is increasing over time we may capture an object that is leaking memory. The method compare_to() is called on snapshots to compare it with another snapshot. The ‘filename’ parameter is used to group all allocations by module. This helps to narrow a search to a module that is leaking memory.

current = tracemalloc.take_snapshot()
stats = current.compare_to(start, 'filename')
for i, stat in enumerate(stats[:5], 1):
    logging.info("since_start", i=i, stat=str(stat))

The output will look similar to this:

1 /usr/local/lib/python3.6/ssl.py:0: size=2569 KiB (+2569 KiB), count=43 (+43), average=59.8 KiB
2 /usr/local/lib/python3.6/socket.py:0: size=1845 KiB (+1845 KiB), count=13761 (+13761), average=137 B
3 /usr/local/lib/python3.6/tracemalloc.py:0: size=671 KiB (+671 KiB), count=10862 (+10862), average=63 B
4 /usr/local/lib/python3.6/linecache.py:0: size=371 KiB (+371 KiB), count=3639 (+3639), average=104 B
5 /usr/local/lib/python3.6/mimetypes.py:0: size=126 KiB (+126 KiB), count=1679 (+1679), average=77 B

Once a suspect module is identified, it may be possible to find the exact line of code responsible for a memory allocation. tracemalloc provides a way to view a stack trace for any memory allocation. As with a Python exception traceback, it shows the line and module where an allocation occurred and all the calls that came before.

traces = current.statistics('traceback')
for stat in traces[1]:
    logging.info("traceback", memory_blocks=stat.count, size_kB=stat.size / 1024))
    for line in stat.traceback.format():
        logging.info(line)

memory_blocks=2725 size_kB=346.0341796875
  File "/usr/local/lib/python3.6/socket.py", line 657
    self._sock = None
  File "/usr/local/lib/python3.6/http/client.py", line 403
    fp.close()
  File "/usr/local/lib/python3.6/http/client.py", line 410
    self._close_conn()
  File "/usr/local/lib/python3.6/site-packages/ddtrace/writer.py", line 166
    result_traces = None
  File "/usr/local/lib/python3.6/threading.py", line 864
  File "/usr/local/lib/python3.6/threading.py", line 916
    self.run()
  File "/usr/local/lib/python3.6/threading.py", line 884
    self._bootstrap_inner()

Reading bottom to top, this shows a trace to a line in the socket module where a memory allocation took place. With this information, it may be possible to finally isolate the cause of the memory leak.

In this first section, we saw that tracemalloc takes snapshots of memory and provides statistics about the memory allocation. The next section describes the search for an actual memory leak in one BuzzFeed microservice.

The Search for Our Memory Leak

Over several months we observed the classic saw-tooth of an application with a memory leak.

a graph showing the classic saw-tooth line of a memory leak, with highlights indicating snapshots taken

We instrumented the microservice with a call to trace_leak() to log the statistics found in the tracemalloc snapshots. The code loops forever and sleeps for some delay in each loop.

async def trace_leak(delay=60, top=20, trace=1):
    """
    Use spawn_callback to invoke:
        tornado.ioloop.IOLoop.current().spawn_callback(trace_leak, delay=300, top=10, trace=3)
    :param delay: in seconds (int)
    :param top: number of top allocations to list (int)
    :param trace: number of top allocations to trace (int)
    :return: None
    """
    logger.info('start_trace', delay=delay, top=top, trace=trace)
    tracemalloc.start(25)
    start = tracemalloc.take_snapshot()
    prev = start
    while True:
            await tornado.gen.sleep(delay)
            current = tracemalloc.take_snapshot()
                # compare current snapshot to starting snapshot
            stats = current.compare_to(start, 'filename')
                # compare current snapshot to previous snapshot
            prev_stats = current.compare_to(prev, 'lineno')

            logger.info('Top Diffs since Start')
        # Print top diffs: current snapshot - start snapshot       
        for i, stat in enumerate(stats[:top], 1):
                logger.info('top_diffs', i=i, stat=str(stat))

            logger.info('Top Incremental')
                # Print top incremental stats: current snapshot - previous snapshot 
            for i, stat in enumerate(prev_stats[:top], 1):
                logger.info('top_incremental', i=i, stat=str(stat))

            logger.info('Top Current')
                # Print top current stats
            for i, stat in enumerate(current.statistics('filename')[:top], 1):
                logger.info('top_current', i=i, stat=str(stat))

                # get tracebacks (stack trace) for the current snapshot
            traces = current.statistics('traceback')
            for stat in traces[:trace]:
                logger.info('traceback', memory_blocks=stat.count, size_kB=stat.size / 1024)
                for line in stat.traceback.format():
                        logger.info(line)
        # set previous snapshot to current snapshot
        prev = current

The microservice is built using tornado so we call it using spawn_callback() and pass parameters delay, top, and trace:

tornado.ioloop.IOLoop.current().spawn_callback(trace_leak, delay=300, top=5, trace=1)

The logs for a single iteration showed allocations occurring in several modules:

1 /usr/local/lib/python3.6/ssl.py:0: size=2569 KiB (+2569 KiB), count=43 (+43), average=59.8 KiB
2 /usr/local/lib/python3.6/socket.py:0: size=1845 KiB (+1845 KiB), count=13761 (+13761), average=137 B
3 /usr/local/lib/python3.6/tracemalloc.py:0: size=671 KiB (+671 KiB), count=10862 (+10862), average=63 B
4 /usr/local/lib/python3.6/linecache.py:0: size=371 KiB (+371 KiB), count=3639 (+3639), average=104 B
5 /usr/local/lib/python3.6/mimetypes.py:0: size=126 KiB (+126 KiB), count=1679 (+1679), average=77 B

tracemalloc is not the source of the memory leak! However, it does require some memory so it shows up here. After running the service for several hours we use DataDog to filter the logs by module and we start to see a pattern with socket.py:

/usr/local/lib/python3.6/socket.py:0: size=1840 KiB (+1840 KiB)
/usr/local/lib/python3.6/socket.py:0: size=1840 KiB (+1840 KiB)
/usr/local/lib/python3.6/socket.py:0: size=1841 KiB (+1841 KiB)
#                               Increase here ^
/usr/local/lib/python3.6/socket.py:0: size=1841 KiB (+1841 KiB)
/usr/local/lib/python3.6/socket.py:0: size=1841 KiB (+1841 KiB)
/usr/local/lib/python3.6/socket.py:0: size=1841 KiB (+1841 KiB)
/usr/local/lib/python3.6/socket.py:0: size=1842 KiB (+1842 KiB)
#                               Increase here ^
/usr/local/lib/python3.6/socket.py:0: size=1843 KiB (+1843 KiB)
/usr/local/lib/python3.6/socket.py:0: size=1843 KiB (+1843 KiB)
#                               Increase here ^
/usr/local/lib/python3.6/socket.py:0: size=1844 KiB (+1844 KiB)
#                               Increase here ^
/usr/local/lib/python3.6/socket.py:0: size=1844 KiB (+1844 KiB)
/usr/local/lib/python3.6/socket.py:0: size=1845 KiB (+1845 KiB)
#                               Increase here ^
/usr/local/lib/python3.6/socket.py:0: size=1845 KiB (+1845 KiB)

The size of the allocation for socket.py is increasing from 1840 KiB to 1845 KiB. None of the other modules exhibited this clear trend. We next look at the traceback for socket.py.

We identify a possible cause

We get a stack trace from tracemalloc for the socket module.

  File "/usr/local/lib/python3.6/socket.py", line 657
    self._sock = None
  File "/usr/local/lib/python3.6/http/client.py", line 403
    fp.close()
  File "/usr/local/lib/python3.6/http/client.py", line 410
    self._close_conn()
  File "/usr/local/lib/python3.6/site-packages/ddtrace/writer.py", line 166
    result_traces = None
  File "/usr/local/lib/python3.6/threading.py", line 864
  File "/usr/local/lib/python3.6/threading.py", line 916
    self.run()
  File "/usr/local/lib/python3.6/threading.py", line 884
    self._bootstrap_inner()

Initially, I want to assume that Python and the standard library is solid and not leaking memory. Everything in this trace is part of the Python 3.6 standard library except for a package from DataDog ddtrace/writer.py. Given my assumption about the integrity of Python, a package provided by a third-party seems like a good place to start investigating further.

It's still leaking

We find when ddtrace was added to our service and do a quick rollback of requirements and then start monitoring the memory again.

a graph showing a not-as-frequent saw-tooth memory leak graph

Another look at the logs

Over the course of several days, the memory continues to rise. Removing the module did not stop the leak. We did not find the leaking culprit. So it’s back to the logs to find another suspect.

1: /usr/local/lib/python3.6/ssl.py:0: size=2568 KiB (+2568 KiB), count=28 (+28), average=91.7 KiB
2: /usr/local/lib/python3.6/tracemalloc.py:0: size=816 KiB (+816 KiB), count=13126 (+13126), average=64 B
3: /usr/local/lib/python3.6/linecache.py:0: size=521 KiB (+521 KiB), count=5150 (+5150), average=104 B
4: /usr/local/lib/python3.6/mimetypes.py:0: size=130 KiB (+130 KiB), count=1699 (+1699), average=78 B
5: /usr/local/lib/python3.6/site-packages/tornado/gen.py:0: size=120 KiB (+120 KiB), count=368 (+368),

There is nothing in these logs that looks suspicious on its own. However, ssl.py is allocating the largest chunk by far, 2.5 MB of memory. Over time the logs show that this remains constant, neither increasing nor decreasing. Without much else to go on, we start checking the tracebacks for ssl.py.

  File "/usr/local/lib/python3.6/ssl.py", line 645
    return self._sslobj.peer_certificate(binary_form)
  File "/usr/local/lib/python3.6/ssl.py", line 688
  File "/usr/local/lib/python3.6/ssl.py", line 1061
    self._sslobj.do_handshake()
  File "/usr/local/lib/python3.6/site-packages/tornado/iostream.py", line 1310
    self.socket.do_handshake()
  File "/usr/local/lib/python3.6/site-packages/tornado/iostream.py", line 1390
    self._do_ssl_handshake()
  File "/usr/local/lib/python3.6/site-packages/tornado/iostream.py", line 519
    self._handle_read()
  File "/usr/local/lib/python3.6/site-packages/tornado/stack_context.py", line 277
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/tornado/ioloop.py", line 888
    handler_func(fd_obj, events)
  File "/app/socialmc.py", line 76
    tornado.ioloop.IOLoop.instance().start()

A solid lead

The top of the stack shows a call on line 645 of ssl.py to peer_certificate(). Without much else to go on, we make a long-shot Google search for “python memory leak SSL peer_certificate” and get a link to a Python bug report. Fortunately, this bug was resolved. Now it was simply a matter of updating our container image from Python 3.6.1 to Python 3.6.4 to get the fixed version and see if it resolved our memory leak.

Looks good

After updating the image we monitor the memory again with DataDog. After a fresh deploy around Sept. 9th, the memory now runs flat.

a graph showing the declining spikes and memory usage flattening out

Summary

Having the right tools for the job can make the difference between solving the problem and not. The search for our memory leak took place over two months. Tracemalloc provides good insight into the memory allocations happening in a Python program; however, it does not know about the memory allocations that take place in packages that are allocating memory in C/C++. In the end, tracking down memory leaks requires patience, persistence, and a bit of detective work.

References

https://docs.python.org/3/library/tracemalloc.html
https://www.fugue.co/blog/2017-03-06-diagnosing-and-fixing-memory-leaks-in-python.html


This post was originally posted on Jan. 17th, 2019, on BF Tech's Medium.

Top comments (0)