Python for Production: Accelerating Number-crunching without recoding

Screenshot_1

So, you’ve probably been using Python for prototyping data science and number crunching jobs, but when it comes to putting code into production, you go for some variant of C. Why? Well, by its very nature, Python is single-threaded, and the global interpreter lock (GIL) is the main culprit, forcing threads to wait for completion before launching another. The end effect is an awesome interpreter that can be slower than molasses when it comes to big jobs.

What most Pythonistas don’t know is that the performance bottleneck for Python code has been largely eliminated, without the need to change the Python code at all! You can get machine-language execution speeds, unlock the GIL blockade, and enable Python to take advantages of multiple threads and cores for nearly every project by utilizing a group of libraries and tools created by none other than Intel.

First, the Intel® Distribution for Python* (IDP) has a vast array of packages that help accelerate Python code, by bringing the actual execution of your Python code out of the Python layer entirely by using C functions, that then call on processor vectorization and parallelization to drive performance up and execution time down – often by orders of magnitude.

Intel® Threaded Building Blocks (TBB), an alternative to OpenMP, provides compiled OS code to split up work, keep caches hot, and balance loads when users call for parallelism. TBB identifies what it can break up and distributes to 2 or more cores, and its code sits under everything else, so nested calls can occur. Then there is the Intel® Data Analytics Acceleration Library (DAAL), which speeds classical machine learning and data science workflows for both streaming and distributed algorithms and is fully integrated with the rest of the Intel library packages. Also optimized for Python on Intel CPUs are packages such as NumPy – which is accelerated with the Intel® Math Kernel Library (MKL), SciPy – the standard scientific toolset also using MKL, and numba – a just-in-time compiler that allows the latest SIMD features and multi-core executive to get the very most out of today’s CPUs. These are just a few of the scores of packages you’ll find in the Intel® Distribution for Python*.

Once you’ve found the right acceleration tools, there’s the issue of tuning the code to get the most performance. It could be nested loops are hanging code and preventing multiple threads to execute, or some other coding issue is a bottleneck. If only you knew exactly WHICH line of code is the culprit. Well, Intel has the tool for that as well, in the form of Intel® VTune Amplifier™, a GUI-based profiling tool that can auto-detect mixed Python/C code, identify hot spots, and provide a full call stack listing that shows performance over time, only costs about 15% overhead, and doesn’t require any code modification. It shows when multiple threads are running, how many cores are active at a given time, and pinpoints problems from either a bottom-up or top-down approach, all from an intuitive, easy to use interface.

Even better, you can easily Download it for free today.

After all, who better than Intel would know how to get the most out of their processors?

Education Ecosystem Blog

Featured in

Python for Production: Accelerating Number-crunching without recoding

About author

Dr. Michael J. Garbade