Processes Census American Community Survey (ACS) 5-year summary files and TIGER geographic boundary data into queryable DuckDB databases with percentile rankings and spatial query support.
- Devbox for development environment management
- ~600GB disk space for temporary processing files
- 20GB+ RAM recommended for percentile processing
# Clone repository and enter devbox environment
devbox shellComplete pipeline to create production databases:
# Step 1: Download ACS data (latest year detected automatically)
python -m src.main download-acs
# Step 2: Download TIGER boundary files
python -m src.main download-tiger
# Step 3: Parse ACS DAT files into unified CSV
python -m src.main parse-acs --year 2023
# Step 4: Load data into staging database
python -m src.main load --year 2023
# Step 5: Calculate percentiles
python -m src.main percentiles --year 2023
# Step 6: Generate JSON summaries
python -m src.main json-summaries --year 2023
# Step 7: Create production databases
python -m src.main create-production --year 2023Final production databases are created in output/:
census_acs.production.percentiles.db- Percentile rankings for all ACS estimates with TIGER geometriescensus_acs.production.json_summaries.db- JSON-formatted summaries by geography with spatial query support
Staging databases are created during processing in output/census_acs.staging.*.db and can be removed after production databases are created.
download-acs- Download ACS 5-year summary data ZIP filedownload-tiger- Download TIGER geographic boundary shapefilesparse-acs- Parse ACS DAT files into unified gzipped CSV formatload- Load parsed CSV and TIGER data into staging databasepercentiles- Calculate national/state/county percentile rankingsjson-summaries- Create JSON-formatted summaries by GEO_IDcreate-production- Generate final production-ready databases
- Project Summary - High-level overview
- Workflow Stages - Detailed pipeline explanation
- Module Implementation - Technical implementation guide
- Design Decisions - Architecture choices and rationale
- ACS and TIGER Background - Census data format reference
Key settings in config/settings.py:
production_partitions- Number of hash partitions for percentile processing (default: 96)memory_limit- DuckDB memory limit (default: 20GB)threads- Number of threads for parallel operations (default: 16)temp_directory_size- Maximum temporary disk space (default: 300GiB)
Active TIGER summary levels are configured in config/tiger_levels.toml.
MIT License