Production-grade, end-to-end benchmarking framework for Polars at scale.
- Provide reproducible, realistic benchmarks for Polars across 1e7–1e9 rows.
- Focus on OLAP aggregations, time-series joins, and windowed analytics with Parquet-backed IO and lazy execution.
- Install dependencies (Poetry recommended):
poetry install- Generate synthetic dataset (edit
config/dataset.tomlfirst):
poetry run python scripts/generate_data.py- Run benchmarks (edit
config/benchmark.tomlto tune):
poetry run python scripts/run_benchmarks.py- Run tests and type checks:
poetry run pytest
poetry run mypy src/config/ # dataset and benchmark configs (TOML)
data/ # generated and parquet files
src/ # core library: dataset, benchmarks, engine, metrics
scripts/ # small CLI helpers to generate and run benchmarks
results/ # raw numeric outputs and reports
tests/ # unit tests
pyproject.toml
Makefile
README.md
LICENSE
- Data is written to Parquet (Snappy/ZSTD) and read with Polars lazy scans to measure realistic IO + CPU behavior.
- Benchmarks measure wall-time and peak memory; each benchmark repeats multiple times and writes raw results to
results/raw/. - The codebase is modular so you can add new workloads under
src/benchmarks/and new data generators undersrc/dataset/.
- Target: Linux/WSL2, 8+ cores, 32–128GB RAM, NVMe SSD. Benchmarks will scale across machines without code changes.
- Open an issue or PR on GitHub. Add tests for new benchmarks and run
mypy.
This project is released under the MIT License — see LICENSE.
2025 Seyyed Ali Mohammadiyeh