quadrivium

Open-source harmonization of U.S. higher-education survey data into reproducible analytical panels.

The current scope is NSF HERD (Higher Education Research and Development survey), FY 1972–2024. The roadmap covers IPEDS, NSF GSS, and other NCSES surveys.

What makes this different

Higher-education survey data has methodological discontinuities — era boundaries in survey instruments, encoding shifts, taxonomy redesigns, infrastructure changes. Most published analyses treat the data as if those discontinuities don't exist, or skip the eras where they do.

Quadrivium applies Reconstructive Harmonization:

(a) reconstruct what each era can support on its own terms (rules, crosswalks, validated reconstructions);

(b) decompose what crossing a discontinuity actually involves into named, quantified components (real growth, definitional change, population expansion, residual unmeasurables);

(c) publish both the reconstruction and the decomposition with sufficient documentation that a cold reader can use either without misreading the discontinuity.

This is not a bridge across discontinuities. It is the discipline of making operational data legible across them by being precise about what is reconstructible, what is decomposable, and what remains unmeasurable. See docs/methods_notes/reconstructive_harmonization.md for the methodological account applied to the 2010 HERD era boundary.

Canonical artifacts

data/harmonized/herd_panel.parquet — 50-year field-level R&D expenditure panel (FY 1975–2024), two parallel reconstructed series across the 2010 era boundary.
data/harmonized/herd_panel_attributes.parquet — institution-year Q4/Q5 attribute sibling: medical-school and clinical-trials share and value columns.
data/harmonized/herd_personnel.parquet — Q15 headcount + Q16 FTE personnel panel for FY 2022–2024 (the microdata-bearing years; NCSES Data Table 26 publishes institution totals for FY 2020–2024, but FY 2020–2021 are aggregate-only, with no per-institution microdata). Carries no quality_flag column — a documented imputation-provenance asymmetry with the financial panel (see docs/methods_notes/herd_panel_etl_scoping.md §12).

Companion validation reports in validation/reports/ carry the reconciliation against published NSF / NCSES ground truth.

Quick start

git clone https://github.com/QuinnyXu/quadrivium.git quadrivium
cd quadrivium
uv sync
uv run python etl/build_herd_panel.py        # rebuild financial + attribute parquets
uv run python etl/build_herd_personnel.py    # rebuild personnel parquet

Requirements: Python 3.12 and uv (installed locally; this repo pins uv 0.11.8 in the lockfile and runtime deps to duckdb==1.5.2 + pypdf==6.10.2).

Raw NSF HERD zips are not redistributed via git. SHA-256 manifests in data/raw/MANIFEST.md document the exact files that reproduce the harmonized outputs; download from NSF's HERD survey archive (URLs listed in the MANIFEST).

Reproducibility contract

A cold reader with the lockfile, the raw zips named in data/raw/MANIFEST.md, and the NCSES reference PDFs in data/reference/ reaches the same harmonized parquet bit-equivalently (modulo parquet writer determinism on a fixed input-and-code-version pair).

What ships in the deposit vs. what you fetch from NSF

Ships in the deposit (tracked in git, CC-BY-4.0): the three harmonized parquets in data/harmonized/ — SHA-256s pinned in data/harmonized/MANIFEST.md — plus the crosswalks, the methods notes, the validation reports, the NCSES reference PDFs (data/reference/), the lockfile, and the build scripts. You can use the harmonized panels directly, or rebuild them.

Fetched from NSF (not redistributed): the 53 raw HERD year zips and 13 short-form zips. Their SHA-256s and download URLs are in data/raw/MANIFEST.md; they are U.S. government work, staged by checksum rather than redistributed (the provenance-clean choice — the zip is the bit-identical artifact NSF shipped). A consumer rebuilding from raw obtains them from NSF's HERD archive.

The integrity round-trip: raw-zip SHAs (NSF-fetched, data/raw/MANIFEST.md) → uv sync + build → harmonized-parquet SHAs (deposit-shipped, data/harmonized/MANIFEST.md). A consumer who fetches the raw zips, verifies them against data/raw/MANIFEST.md, runs uv sync and the build scripts, reproduces the harmonized SHAs in data/harmonized/MANIFEST.md. This round-trip is verified end-to-end from a clean checkout (the harmonized panel rebuilds to the exact pinned SHA, and the FY 2024 verification grid re-asserts 58/58 at +0.000%).

Methods-note figures are not deposit runtime. To rebuild figures:

uv sync --group charts
uv run --group charts python etl/spikes/era_2010_decomposition_chart.py
uv run --group charts python etl/spikes/herd_question_count_cliff_chart.py

Methods note

The HERD methods note lives at docs/methods_notes/reconstructive_harmonization.md. The deposit's personnel sibling README is at docs/methods_notes/herd_personnel_README.md. The HERD per-year profile is at docs/methods_notes/herd_profile.md.

The full HD 2.1 / HD 2.4 implementation contract — schema, era handling, codeset policy, validation gates — is in docs/methods_notes/herd_panel_etl_scoping.md and docs/hd_2_1_scoping.md.

Repository layout

quadrivium/
├── CLAUDE.md                    project doctrine, locked decisions
├── README.md                    you are here
├── LICENSE                      MIT (code)
├── LICENSE-DATA.md              CC-BY-4.0 (data)
├── crosswalks/                  discipline + question-mapping CSVs (decision_rationale tracked)
├── data/
│   ├── raw/                     raw NSF zips (gitignored payload); MANIFEST.md is the SHA-256 anchor
│   ├── harmonized/              canonical parquets
│   └── reference/               NCSES reference PDFs; MANIFEST.md is the staging anchor
├── docs/                        methods notes, scoping, source documents
├── etl/                         loaders, builders, spikes
└── validation/                  reconciliation reports, per-year profiling

Roadmap

Quadrivium is at Stage 1 of a three-stage trajectory:

Stage 1 (current) — open datasets. HERD harmonization (current). Future migrations: IPEDS, NSF GSS, other NCSES surveys. Each migration applies the Reconstructive Harmonization methodology to that survey's discontinuities; the schema and validation patterns adapt to the survey's structure, the methodology does not.
Stage 2 (planned) — platform. Interactive query and comparative-panel surface on top of the harmonized data.
Stage 3 (planned) — commercial analytics. Analytics built on the platform.

Stages 2 and 3 are not built now; they are the durable framing of where the project goes. Stage-1 work does not assume Stage-2 readiness.

License

Code: MIT — see LICENSE.
Data: CC-BY-4.0 — see LICENSE-DATA.md.

Citation

If you use quadrivium's harmonized panels in research, please cite the deposit and the methods note. Machine-readable citation metadata is in CITATION.cff — the single source of truth for the DOI. The DOI below is the concept DOI (all versions), minted on Zenodo.

Plain text:

Quadrivium contributors (2026). Quadrivium: Reconstructive Harmonization of U.S. Higher-Education Survey Data. Version 2.0.0. Zenodo. DOI: 10.5281/zenodo.20404785 (concept DOI, all versions; see CITATION.cff). License: CC-BY-4.0. Version 2.0 contains two datasets — HERD (R&D expenditure-OUT panels) and Federal S&E Support (federal funding-IN) — joined via the cross-survey institution-identity spine.

BibTeX:

@dataset{quadrivium_2026,
  author    = {{Quadrivium contributors}},
  title     = {{Quadrivium: Reconstructive Harmonization of U.S. Higher-Education Survey Data}},
  year      = {2026},
  version   = {2.0.0},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.20404785},
  note      = {Concept DOI (all versions); v2.0.0 version DOI 10.5281/zenodo.20469884; v1.0.0 (HERD-only) version DOI 10.5281/zenodo.20404786. Data CC-BY-4.0; code MIT.}
}

Contributing

External contribution flow is currently issue-based. To propose a crosswalk amendment or methodology extension, open a GitHub issue with: the proposed change, the empirical anchor (which raw HERD year and file, or which published NSF document), and the decision_rationale you would add to the crosswalk row. See CONTRIBUTING.md for full proposal guidance; pull-request mechanics arrive at the platform stage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

quadrivium

What makes this different

Canonical artifacts

Quick start

Reproducibility contract

What ships in the deposit vs. what you fetch from NSF

Methods note

Repository layout

Roadmap

License

Citation

Contributing

About

Licenses found

Uh oh!

Releases 2

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
crosswalks		crosswalks
data		data
docs		docs
etl		etl
memory		memory
seeds		seeds
validation		validation
.gitignore		.gitignore
.zenodo.json		.zenodo.json
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-DATA.md		LICENSE-DATA.md
MEMORY.md		MEMORY.md
PANEL_SKIPPER.md		PANEL_SKIPPER.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

quadrivium

What makes this different

Canonical artifacts

Quick start

Reproducibility contract

What ships in the deposit vs. what you fetch from NSF

Methods note

Repository layout

Roadmap

License

Citation

Contributing

About

Topics

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Contributors

Uh oh!

Languages