Plugging Leaks With Multiprocessing

At one time or another, we all have to deal with Python modules which are simply wrappers around code written in other languages such as C or Fortran. Particulatly when it comes to scientific data, there are simply too many maintained and pre-existing code libraries to ignore. That Python is great glue for such libraries is one of the great things about Python.

However, code libraries written in static languages without garbage collection often have memory leaks. Sometimes the memory leaks are even well known. Python prevents its own objects from leaking memory through its garbage collection. But extension libraries can leave unfreed objects which are unreachable by Python's garbage collector. These leaks can evenutally crash your Python if allowed to build up by repeated invocations of the leaky code within the same program.

If you can isolate the offending leaky extension objects into a function, then you are in luck. Python will allow you to call that function in its own process. Then when the process ends, your operating system will reclaim all that leaked memory for you.

The multiprocessing module is your friend. Use it to run leaky extensions in their own processes:

import multiprocessing

def sir_leaks_a_lot(datum):
    # Put your leaky extension code here.
    # For instance, matplotlib functions which
    # crash with "std::bad_alloc" errors when
    # called repeatedly.
    pass

for datum in data:
    p = multiprocessing.Process(target=sir_leaks_a_lot, args=(datum,))
    p.start()
    p.join()
    assert not p.exitcode, \
           "Exitcode %s from processing %s" % (p.exitcode, datum)

The start() method will run sir_leaks_a_lot with its parameters bound to the elements of the args tuple in the call to Process(). The join() method will wait for the process to finish.

You could run multiple sir_leaks_a_lot processes at once intead of allowing each to finish one at a time. But then, that would allow the leaks to build up and crash your program again. So using join() causes each leaky process to get cleaned up before running the next one.

Now you are ready to generate tens of thousands of large matplotlib plots in a single cron job!

social