Scanning for memory issues in your data pipelines

Published in

Analytics Vidhya

4 min readOct 17, 2020

$ pip install filprofiler%load_ext filprofiler # use the python fil kernel%%filprofile # run this in the cell you wish to evaluate

“the resulting sudden system failures, blue screens, and downtime were increasingly unacceptable in this day and age.”

I took some time to learn about how python handles its memory management using a reference counting and garbage collection system. Data Science can be bottlenecked by a systems CPU and memory among other things. A CPU struggling to keep up with the load typically appears as your system ‘slowing down’. However, memory issues like leaks and high memory usage spikes, etc. can have outcomes that are disastrous, often causing shutdowns, blue screens, and even loss of data.

I recently had an old pair of memory sticks just two 2gbs sticks turn faulty and the resulting sudden system failures, blue screens, and downtime were increasingly unacceptable in this day and age.

In context, the tool that I am going to introduce in this article can be used for many purposes:

If you are finding it difficult loading in large amounts of data
Want to see what part of your code uses the maximum amount of memory
Want to operate on smaller ram costs etc.

Then Python Filprofiler is for you. There are other memory profilers for Python but in the words of the creator of Filprofiler

“ none of them are designed for batch processing applications that read in data, process it, and write out the result.”— Itamar Turner-Trauring

Unlike servers where data leaks are more often the cause of memory problems, data pipelines are more temporary solutions and so memory spikes are a more common problem. “This is Fil’s primary goal: diagnosing spikes in memory usage.” — Itamar, T.T.

Give it a go today and see where your memory is being used the most (Linux and macOS only at the moment):

Installation and Usage

Install the package with:

$ pip install --upgrade pip
$ pip install filprofiler

Or Conda:

$ conda install -c conda-forge filprofiler

Using with a Script File

If you have a script you run with your code inside:

$ python yourscript.py --load-file=yourfile

Just change the python to ‘fil-profile run’ like so:

$ fil-profile run yourscript.py --load-file=yourfile

Using with a Jupyter Notebook

After installation, ensure that you load the notebook with the alternative kernel “Python 3 with Fil”. You can change kernels inside the notebook at the top right corner

click the kernel in the top right corner

select from the drop-down list the alternative kernel

Once the kernel is selected you can load the extension in anywhere in the notebook with:

%load_ext filprofiler

and you can run the profile any specific cell by adding ‘%%filprofile’ magic to the top of the cell

%%filprofile

Here is an example where I use it on a recommender engine that predicts an item for a user:

The profile displays a heatmap of where memory usage is happening. Each individual cell can be clicked and each traceback where the memory allocation originates can be observed. Above you can see that most memory usage occurs in the reading of the dataset, the construction, and building of the dataset for the surprise library, and the fitting of the algorithm.

The creator intends to keep it as lean and easy to use. He wants it to “just work”. Usage for Windows systems is on the way.

Thank you for reading! I hope this article helped you understand more about Python's memory management systems and how you can optimise your code for minimising memory spikes in your data pipelines. You can read more about the fil profiler in this blog post and here is a guide to reducing memory usage for data scientists

References

Memory Management in Python — Official Documentation
Clinging to memory: how Python function calls can increase your memory usage — Blog post, Itamar, T.T
Filprofiler — GitHub
Fil: a new Python memory profiler for data scientists and — Blog post, Itamar, T.T
Guide to reducing memory usage for Data Scientists — Blog post, Itamar, T.T

Scanning for memory issues in your data pipelines

Installation and Usage

Using with a Script File

Using with a Jupyter Notebook

References

Written by Aleksandar Gakovic