Open-source harmonization of U.S. higher-education survey data into reproducible analytical panels.
The current scope is NSF HERD (Higher Education Research and Development survey), FY 1972–2024. The roadmap covers IPEDS, NSF GSS, and other NCSES surveys.
Higher-education survey data has methodological discontinuities — era boundaries in survey instruments, encoding shifts, taxonomy redesigns, infrastructure changes. Most published analyses treat the data as if those discontinuities don't exist, or skip the eras where they do.
Quadrivium applies Reconstructive Harmonization:
(a) reconstruct what each era can support on its own terms (rules, crosswalks, validated reconstructions);
(b) decompose what crossing a discontinuity actually involves into named, quantified components (real growth, definitional change, population expansion, residual unmeasurables);
(c) publish both the reconstruction and the decomposition with sufficient documentation that a cold reader can use either without misreading the discontinuity.
This is not a bridge across discontinuities. It is the discipline of making operational data legible across them by being precise about what is reconstructible, what is decomposable, and what remains unmeasurable. See docs/methods_notes/reconstructive_harmonization.md for the methodological account applied to the 2010 HERD era boundary.
data/harmonized/herd_panel.parquet— 50-year field-level R&D expenditure panel (FY 1975–2024), two parallel reconstructed series across the 2010 era boundary.data/harmonized/herd_panel_attributes.parquet— institution-year Q4/Q5 attribute sibling: medical-school and clinical-trials share and value columns.data/harmonized/herd_personnel.parquet— Q15 headcount + Q16 FTE personnel panel for FY 2022–2024 (the microdata-bearing years; NCSES Data Table 26 publishes institution totals for FY 2020–2024, but FY 2020–2021 are aggregate-only, with no per-institution microdata). Carries noquality_flagcolumn — a documented imputation-provenance asymmetry with the financial panel (seedocs/methods_notes/herd_panel_etl_scoping.md§12).
Companion validation reports in validation/reports/ carry the reconciliation against published NSF / NCSES ground truth.
git clone https://github.com/QuinnyXu/quadrivium.git quadrivium
cd quadrivium
uv sync
uv run python etl/build_herd_panel.py # rebuild financial + attribute parquets
uv run python etl/build_herd_personnel.py # rebuild personnel parquetRequirements: Python 3.12 and uv (installed locally; this repo pins uv 0.11.8 in the lockfile and runtime deps to duckdb==1.5.2 + pypdf==6.10.2).
Raw NSF HERD zips are not redistributed via git. SHA-256 manifests in data/raw/MANIFEST.md document the exact files that reproduce the harmonized outputs; download from NSF's HERD survey archive (URLs listed in the MANIFEST).
A cold reader with the lockfile, the raw zips named in data/raw/MANIFEST.md, and the NCSES reference PDFs in data/reference/ reaches the same harmonized parquet bit-equivalently (modulo parquet writer determinism on a fixed input-and-code-version pair).
Ships in the deposit (tracked in git, CC-BY-4.0): the three harmonized parquets in data/harmonized/ — SHA-256s pinned in data/harmonized/MANIFEST.md — plus the crosswalks, the methods notes, the validation reports, the NCSES reference PDFs (data/reference/), the lockfile, and the build scripts. You can use the harmonized panels directly, or rebuild them.
Fetched from NSF (not redistributed): the 53 raw HERD year zips and 13 short-form zips. Their SHA-256s and download URLs are in data/raw/MANIFEST.md; they are U.S. government work, staged by checksum rather than redistributed (the provenance-clean choice — the zip is the bit-identical artifact NSF shipped). A consumer rebuilding from raw obtains them from NSF's HERD archive.
The integrity round-trip: raw-zip SHAs (NSF-fetched, data/raw/MANIFEST.md) → uv sync + build → harmonized-parquet SHAs (deposit-shipped, data/harmonized/MANIFEST.md). A consumer who fetches the raw zips, verifies them against data/raw/MANIFEST.md, runs uv sync and the build scripts, reproduces the harmonized SHAs in data/harmonized/MANIFEST.md. This round-trip is verified end-to-end from a clean checkout (the harmonized panel rebuilds to the exact pinned SHA, and the FY 2024 verification grid re-asserts 58/58 at +0.000%).
Methods-note figures are not deposit runtime. To rebuild figures:
uv sync --group charts
uv run --group charts python etl/spikes/era_2010_decomposition_chart.py
uv run --group charts python etl/spikes/herd_question_count_cliff_chart.pyThe HERD methods note lives at docs/methods_notes/reconstructive_harmonization.md. The deposit's personnel sibling README is at docs/methods_notes/herd_personnel_README.md. The HERD per-year profile is at docs/methods_notes/herd_profile.md.
The full HD 2.1 / HD 2.4 implementation contract — schema, era handling, codeset policy, validation gates — is in docs/methods_notes/herd_panel_etl_scoping.md and docs/hd_2_1_scoping.md.
quadrivium/
├── CLAUDE.md project doctrine, locked decisions
├── README.md you are here
├── LICENSE MIT (code)
├── LICENSE-DATA.md CC-BY-4.0 (data)
├── crosswalks/ discipline + question-mapping CSVs (decision_rationale tracked)
├── data/
│ ├── raw/ raw NSF zips (gitignored payload); MANIFEST.md is the SHA-256 anchor
│ ├── harmonized/ canonical parquets
│ └── reference/ NCSES reference PDFs; MANIFEST.md is the staging anchor
├── docs/ methods notes, scoping, source documents
├── etl/ loaders, builders, spikes
└── validation/ reconciliation reports, per-year profiling
Quadrivium is at Stage 1 of a three-stage trajectory:
- Stage 1 (current) — open datasets. HERD harmonization (current). Future migrations: IPEDS, NSF GSS, other NCSES surveys. Each migration applies the Reconstructive Harmonization methodology to that survey's discontinuities; the schema and validation patterns adapt to the survey's structure, the methodology does not.
- Stage 2 (planned) — platform. Interactive query and comparative-panel surface on top of the harmonized data.
- Stage 3 (planned) — commercial analytics. Analytics built on the platform.
Stages 2 and 3 are not built now; they are the durable framing of where the project goes. Stage-1 work does not assume Stage-2 readiness.
- Code: MIT — see
LICENSE. - Data: CC-BY-4.0 — see
LICENSE-DATA.md.
If you use quadrivium's harmonized panels in research, please cite the deposit and the methods note. Machine-readable citation metadata is in CITATION.cff — the single source of truth for the DOI. The DOI below is the concept DOI (all versions), minted on Zenodo.
Plain text:
Quadrivium contributors (2026). Quadrivium: Reconstructive Harmonization of U.S. Higher-Education Survey Data. Version 2.0.0. Zenodo. DOI: 10.5281/zenodo.20404785 (concept DOI, all versions; see
CITATION.cff). License: CC-BY-4.0. Version 2.0 contains two datasets — HERD (R&D expenditure-OUT panels) and Federal S&E Support (federal funding-IN) — joined via the cross-survey institution-identity spine.
BibTeX:
@dataset{quadrivium_2026,
author = {{Quadrivium contributors}},
title = {{Quadrivium: Reconstructive Harmonization of U.S. Higher-Education Survey Data}},
year = {2026},
version = {2.0.0},
publisher = {Zenodo},
doi = {10.5281/zenodo.20404785},
note = {Concept DOI (all versions); v2.0.0 version DOI 10.5281/zenodo.20469884; v1.0.0 (HERD-only) version DOI 10.5281/zenodo.20404786. Data CC-BY-4.0; code MIT.}
}External contribution flow is currently issue-based. To propose a crosswalk amendment or methodology extension, open a GitHub issue with: the proposed change, the empirical anchor (which raw HERD year and file, or which published NSF document), and the decision_rationale you would add to the crosswalk row. See CONTRIBUTING.md for full proposal guidance; pull-request mechanics arrive at the platform stage.