Investigating memory leaks in Python
Posted on Nov. 7, 2011 by Ben Dickson.
This is an old post from 2011. The content may be outdated or no longer relevant.
This is an old post I wrote, which was originally posted on neverfear.org on Mon, 7th Nov 2011 1:32:11. Archived here for posterity.
There are 3 comments on this post.
Memory leaks are difficult to trace down. A nice description of finding one is "A server memory leak" (archive.org link) , a story about tracing a memory leak in a Django application.
I'm trying to work out why a server process is slowly using more and more memory. The server parses a stream of text, and shoves stuff into a database. It's written in Python, and about the only external code it uses is SQLAlchemy
I found a combination of Meliae (archive.org link) and objgraph (archive.org link) useful. Pympler (archive.org link) also seems useful (the tracker class particularly (archive.org link) ). Dowser (archive.org link) also looks useful, and might have made things a bit easier..
So, first we need to get whatever memory-analysing code into the server. You could use pyrasite (archive.org link) , which injects code into an existing process using gdb (it even has helpers for Meliae)
However, it's usually easier and more reliable to modify the code a bit.
The simplest and tidiest way is to use a signal handler which starts pdb. To steal an idea from Bazaar (archive.org link) , SIGQUIT is perfect for this..
In the the code's
main()
function, add something like this:
def start_pdb(signal, trace):
import pdb
pdb.set_trace()
import signal
signal.signal(signal.SIGQUIT, start_pdb)
Then, when the code is running in a terminal, pressing
ctrl+
will start pdb. You can then run any Python code you feel like. After,
you can either continue the program running with "c" (or "continue"), or
kill it with "q" ("quit" or "ctrl+d"):
$ python sigquit_test.py
^--Return--
> sigquit_test.py(10)start_pdb()->None
-> pdb.set_trace()
(Pdb) c
^--Return--
> sigquit_test.py(10)start_pdb()->None
-> pdb.set_trace()
(Pdb) q
Traceback (most recent call last):
[...]
bdb.BdbQuit
$
Now, we can run the server, and press
ctrl+
at any time to start up a Python shell in that environment.
Firstly we can use objgraph to see what the most common type of object in memory is:
(Pdb) import objgraph
(Pdb) objgraph.show_most_common_types()
dict 150951
tuple 82757
InstanceState 65303
[...]
Perhaps unsurprisingly, there is a lot of dict and tuple objects. More interestingly, there is a lot of InstanceState objects.
Next, we can use Meliae to showing how much memory each data-type is using:
(Pdb) import meliae.scanner
(Pdb) meliae.scanner.dump_all_objects("/tmp/memorydump.json")
(Pdb)
Then, in another Python shell we can analyse this dump:
>>> import meliae.loader
>>> md = meliae.loader.load("/tmp/memorydump.json")
loaded line 775946, 775947 objs, 118.7 / 118.7 MiB read in 21.3s
checked 775946 / 775947 collapsed 74972
set parents 700974 / 700975
collapsed in 11.1s
>>> md.summarize()
Total 700975 objects, 874 types, Total size = 276.9MiB (290327578 bytes)
Index Count % Size % Cum Max Kind
0 183823 26 176767720 60 60 786712 dict
1 65303 9 72616936 25 85 1112 InstanceState
2 151981 21 10535744 3 89 520 tuple
So that InstanceState is definitely not helping the memory usage.. Does it relate to those dictionaries?
>>> md.get_all('InstanceState')[0].p
[WeakInstanceDict(4333095200 6291752B 3856refs 1par), dict(4760925744 3352B 2refs 1par), instancemethod(4763880816 80B 3refs 1par), dict(4759467200 3352B 90refs 1par)]
The
get_all
method returns a list of all the
InstanceState objects, ordered by size - so the first index is the
largest one. It contains a reference to two dicts, a weak-ref and
instance method..
For a better explaination, I recommend reading "Memory debugging with Meliae" (archive.org link) and "Step-by-step Meliae" (archive.org link) , written by the author of Meliae.
This info is useful, but maybe a little unintutive. Luckily, objgraph exists.. So, back to the pdb prompt in our "live" server:
(Pdb) import objgraph
(Pdb) objgraph.count("InstanceState")
65303
(Pdb) objgraph.show_backrefs(objgraph.by_type("InstanceState")[0:50])
Graph written to objects.dot (316 nodes)
Image generated as objects.png
objgraph (archive.org link)
is nice in that is pure Python, and produes pretty reference graphs (by writing
.dot
files, which are rendered using graphviz):

I'd recommend reading "Hunting memory leaks in Python" (archive.org link) and "Python object graphs" (archive.org link) , written by the author of objgraph
Now, this description till now makes the problem seem a bit.. easy and linear. Instrument up the code, look at what is using the most memory, and... there's the problem! Not quite..
After a bit of prodding around, the InstanceStates seemed to be mostly referenced by a Least Recently Used (LRU) (archive.org link) cache in the code. I wasn't sure if this memory usage was legitimate cacing, or if the above tree pictured above was causing leaky reference cycles..
The code commits to the database approximately every 30 seconds, so I added some memory usage logging code into this.
Firstly, I created a function based of objgraph's
show_most_common_types
, which to writes the most used objects to a log file:
def debug_log_common_objects():
print "Logging common objects"
import gc
print "garbage collector got", gc.collect()
import objgraph
# Reimplement show_most_common_types, but write to file
import operator
stats = sorted(objgraph.typestats().items(), key=operator.itemgetter(1),
reverse=True)
limit = 30
if limit:
stats = stats[:limit]
width = max(len(name) for name, count in stats)
out = ["#START %s" % time.time()] + ["%s %s" % (name.ljust(width), count) for name, count in stats[:limit]
f = open("most_common_object.txt", "a")
f.write("n".join(out) + "n")
f.close()
print "Logging common objects - done"
Then I left the process running for a while. This function is doing exactly what we first done with the pdb prompt (calling "objgraph.show_most_common_types()"), but with a timestamp:
#START 1320397336.95
tuple 30340
dict 16511
function 7662
list 4792
set 4174
[...]
#START 1320397377.14
tuple 32176
dict 20643
function 7662
list 4901
set 4185
[...]
Then, I wrote a simple, horribly crappy, script to parse and graph these, so I could see which objects were increasing over time
#!/usr/bin/env python
f = open("most_common_object.txt")
alltimings = {}
things = {}
start = None
for line in f.readlines():
if line.startswith("#START"):
alltimings[start] = things
_, _, thetime = line.strip().partition(" ")
start = float(thetime)
things = {}
else:
key, _, value = line.partition(" ")
things[key] = int(value.strip())
times = {}
values = {}
import datetime
for time, things in sorted(alltimings.items()):
for t, howmany in things.items():
times.setdefault(t, []).append(datetime.datetime.fromtimestamp(time))
values.setdefault(t, []).append(howmany)
import matplotlib.pyplot as plt
thing = plt.figure(figsize = (15, 200))
thing.subplots_adjust(hspace = .2)
for i, key in enumerate(times):
ct = times[key]
cv = values[key]
ax = thing.add_subplot(len(times.keys()), 1, i+1)
ax.set_title("%s" % key)
ax.set_ylabel("Number of objects")
ax.plot(ct, cv)
plt.savefig("mem.pdf", figsize = (200, 200), ymargin = 100)
This produces a PDF full of graphs, some of which look like these:

So while there are a lot of dict objects, the number is fluctuating reasonably. Same with
InstanceState
, so that wasn't the leak (if there even is one)
I later confirmed this suspicion on StackOverflow (archive.org link)
- the graph presented above will not leak - as of Python 2.2, the
garbage collector gained cycle detection. There is one important caveat,
which is if any object in the cycle contains a
__del__method, it will NOT be garbage collected.
However, the number of CacheEntities is growing suspiciously, so were a few other types:

..so I think that's what I'll investigate next
As I said, memory leaks are difficult to track down - I still haven't got to the bottom of the problem yet, but hopefully these techniques might help others