Skip to content
bustawin
  • Software
  • Research
  • Management
  • About

Tag: dask

Optimize Pandas & Dask for big datasets: the example

In this post I explain how to optimize Pandas to process huge amounts of data. I explain the three optimizations that allowed me to analyze more than 100 million rows and 59 columns with a regular computer: 1. looping correctly by using Pandas’ builtins, NumPy, and SIMD vectorization; 2. tweaking dtypes; and 3. parallelizing using all CPU cores and unlocking bigger-than-RAM datasets with Dask.

2021-07-18

A collection of publications, notes, and tricks.

Tags

apache architecture big-sheets big data covid dask debian debian-live eReuse.org gdelt global interorganizational network hacks linux networks open data pandas programming psql python research sphinx sql sustainability systemd theory of networks workbench
Hi! We use cookies responsibly for analytics; just accept them ;-) Cookie Policy

RSS feed

Proudly powered by WordPress | Theme: bustawin wp theme by understrap.com.(Version: 0.1.5)