Census ACS DuckDB Importer

Processes Census American Community Survey (ACS) 5-year summary files and TIGER geographic boundary data into queryable DuckDB databases with percentile rankings and spatial query support.

Prerequisites

Devbox for development environment management
~600GB disk space for temporary processing files
20GB+ RAM recommended for percentile processing

Installation

# Clone repository and enter devbox environment
devbox shell

Quick Start

Complete pipeline to create production databases:

# Step 1: Download ACS data (latest year detected automatically)
python -m src.main download-acs

# Step 2: Download TIGER boundary files
python -m src.main download-tiger

# Step 3: Parse ACS DAT files into unified CSV
python -m src.main parse-acs --year 2023

# Step 4: Load data into staging database
python -m src.main load --year 2023

# Step 5: Calculate percentiles
python -m src.main percentiles --year 2023

# Step 6: Generate JSON summaries
python -m src.main json-summaries --year 2023

# Step 7: Create production databases
python -m src.main create-production --year 2023

Output

Final production databases are created in output/:

census_acs.production.percentiles.db - Percentile rankings for all ACS estimates with TIGER geometries
census_acs.production.json_summaries.db - JSON-formatted summaries by geography with spatial query support

Staging databases are created during processing in output/census_acs.staging.*.db and can be removed after production databases are created.

Available Commands

download-acs - Download ACS 5-year summary data ZIP file
download-tiger - Download TIGER geographic boundary shapefiles
parse-acs - Parse ACS DAT files into unified gzipped CSV format
load - Load parsed CSV and TIGER data into staging database
percentiles - Calculate national/state/county percentile rankings
json-summaries - Create JSON-formatted summaries by GEO_ID
create-production - Generate final production-ready databases

Documentation

Project Summary - High-level overview
Workflow Stages - Detailed pipeline explanation
Module Implementation - Technical implementation guide
Design Decisions - Architecture choices and rationale
ACS and TIGER Background - Census data format reference

Configuration

Key settings in config/settings.py:

production_partitions - Number of hash partitions for percentile processing (default: 96)
memory_limit - DuckDB memory limit (default: 20GB)
threads - Number of threads for parallel operations (default: 16)
temp_directory_size - Maximum temporary disk space (default: 300GiB)

Active TIGER summary levels are configured in config/tiger_levels.toml.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
docs		docs
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
devbox.json		devbox.json
devbox.lock		devbox.lock
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Census ACS DuckDB Importer

Prerequisites

Installation

Quick Start

Output

Available Commands

Documentation

Configuration

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Census ACS DuckDB Importer

Prerequisites

Installation

Quick Start

Output

Available Commands

Documentation

Configuration

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages