Tag: research

Optimize Pandas & Dask for big datasets: the example

In this post I explain how to optimize Pandas to process huge amounts of data. I explain the three optimizations that allowed me to analyze more than 100 million rows and 59 columns with a regular computer: 1. looping correctly by using Pandas’ builtins, NumPy, and SIMD vectorization; 2. tweaking dtypes; and 3. parallelizing using all CPU cores and unlocking bigger-than-RAM datasets with Dask.