In this post I explain how to optimize Pandas to process huge amounts of data. I explain the three optimizations that allowed me to analyze more than 100 million rows and 59 columns with a regular computer: 1. looping correctly by using Pandas’ builtins, NumPy, and SIMD vectorization; 2. tweaking dtypes; and 3. parallelizing using all CPU cores and unlocking bigger-than-RAM datasets with Dask.
Tag: python
Optimize Pandas & Dask for big datasets: the example
Big Sheets – Domain Driven Design with a hexagonal architecture
Big Sheets is my attempt to a software using concepts of clean architecture, hexagonal architecture, Domain Driven Design (DDD), and a bit of event-driven programming; by following the amazing book Architecture Patterns with Python.
This post introduces the architecture of the software, so you can take a peek at the code after, and comments a few challenges and learnings from the experience. If you are a practitioner, Big Sheets can serve you as an elaborated example.
Debug segmentation faults in Apache from mod_wsgi
In this guide we show how to get information from apache segmentation faults that come from python’s mod_wsgi.
Create a sphinx extension to customize your docs
With Sphinx is easy to generate documentation of your python project, as long as you don’t require some custom code. This is a tutorial of how to create a quick & dirty Sphinx extension to personalize the docs of your project.