Skip to content

bfs/census_acs_duckdb_importer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Census ACS DuckDB Importer

Processes Census American Community Survey (ACS) 5-year summary files and TIGER geographic boundary data into queryable DuckDB databases with percentile rankings and spatial query support.

Prerequisites

  • Devbox for development environment management
  • ~600GB disk space for temporary processing files
  • 20GB+ RAM recommended for percentile processing

Installation

# Clone repository and enter devbox environment
devbox shell

Quick Start

Complete pipeline to create production databases:

# Step 1: Download ACS data (latest year detected automatically)
python -m src.main download-acs

# Step 2: Download TIGER boundary files
python -m src.main download-tiger

# Step 3: Parse ACS DAT files into unified CSV
python -m src.main parse-acs --year 2023

# Step 4: Load data into staging database
python -m src.main load --year 2023

# Step 5: Calculate percentiles
python -m src.main percentiles --year 2023

# Step 6: Generate JSON summaries
python -m src.main json-summaries --year 2023

# Step 7: Create production databases
python -m src.main create-production --year 2023

Output

Final production databases are created in output/:

  • census_acs.production.percentiles.db - Percentile rankings for all ACS estimates with TIGER geometries
  • census_acs.production.json_summaries.db - JSON-formatted summaries by geography with spatial query support

Staging databases are created during processing in output/census_acs.staging.*.db and can be removed after production databases are created.

Available Commands

  • download-acs - Download ACS 5-year summary data ZIP file
  • download-tiger - Download TIGER geographic boundary shapefiles
  • parse-acs - Parse ACS DAT files into unified gzipped CSV format
  • load - Load parsed CSV and TIGER data into staging database
  • percentiles - Calculate national/state/county percentile rankings
  • json-summaries - Create JSON-formatted summaries by GEO_ID
  • create-production - Generate final production-ready databases

Documentation

Configuration

Key settings in config/settings.py:

  • production_partitions - Number of hash partitions for percentile processing (default: 96)
  • memory_limit - DuckDB memory limit (default: 20GB)
  • threads - Number of threads for parallel operations (default: 16)
  • temp_directory_size - Maximum temporary disk space (default: 300GiB)

Active TIGER summary levels are configured in config/tiger_levels.toml.

License

MIT License

About

Processes Census American Community Survey (ACS) 5-year summary files and TIGER geographic boundary data into queryable DuckDB databases with percentile rankings and spatial query support

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages