The first Analysis-Ready Dataset (ARD) for astronomy: a three-layer enriched dataset built from DESI DR1, with environmental quenching research as the proving ground.
This repository builds an enriched, analysis-ready dataset from DESI Data Release 1, combining galaxy catalogs, void classifications, QSO properties, and spectral data into a unified resource. The environmental quenching study, comparing galaxy properties in cosmic voids versus walls, serves as the first consumer application and validation of the ARD architecture.
Two downstream projects depend on this ARD:
- desi-qso-anomaly-detection β ML anomaly detection on QSO spectra
- desi-quasar-outflows β AGN feedback and outflow energetics
This section provides brief context for those less familiar with the domain. If you already know DESI and void science, skip to Quick Start.
DESI (Dark Energy Spectroscopic Instrument) is a ground-based survey collecting spectra for tens of millions of galaxies and quasars to map the universe's large-scale structure. Data Release 1 includes spectra for over 18 million unique targets with derived physical properties like stellar mass and star formation rate.
Cosmic voids are vast underdense regions in the universe's large-scale structure, essentially the "bubbles" between the filaments and walls of the cosmic web. Galaxies in voids experience minimal environmental interactions (no mergers, no ram-pressure stripping), making them ideal laboratories for studying intrinsic galaxy evolution.
Environmental quenching refers to the suppression of star formation in galaxies due to their surroundings. By comparing void galaxies (isolated) to wall galaxies (in denser environments), we can disentangle "nature" (mass-driven evolution) from "nurture" (environment-driven effects). This is one of the fundamental open questions in galaxy evolution. Recent work using DESI DR1 BGS data has shown that void galaxies tend to be fainter, less massive, and more star-forming compared to wall galaxies.
Navigate to what you need based on your interest.
- π Data Dictionary β Schema reference for catalog columns
- π ARD Schema β Table structure and relationships
- π¬ Phase 04: ARD Foundations β Current architecture work
- π Validation Results β Quality checks and sample characteristics
- ποΈ Work Logs β Complete development history by milestone
- π¦ Spectral Pipeline β FITS to Parquet conversion (2.3TB to 32GB)
- ποΈ Catalog ETL β PostgreSQL ingestion pipeline
- π ARD Output β Where materialized products will live
- π ARD Specification β Domain-agnostic methodology
Current data assets built from DESI DR1 public releases.
| Asset | Count | Source | Status |
|---|---|---|---|
| Galaxy catalog | 6,445,927 rows | FastSpecFit VAC | β Ingested |
| Post-QA sample | 6,342,556 rows | Quality cuts applied | β Validated |
| Void catalog | 10,752 voids | DESIVAST (4 algorithms) | β Ingested |
| Environmental classifications | derived | Void/wall assignment | π In Progress |
| Metric | Value | Notes |
|---|---|---|
| HEALPix tiles processed | 10,800+ | 88.5% of 12,207 total |
| Original FITS size | ~2.3 TB | From DESI S3 archive |
| Parquet output | ~32 GB | 98.6% compression |
| Production runtime | 61 hours | Zero manual intervention |
| Property | Range | Notes |
|---|---|---|
| Redshift | z β [0.001, 1.0] | BGS-dominated sample |
| Stellar mass | log(Mβ /Mβ) β [6, 13] | Full mass range |
| Void algorithms | 4 | VIDE, ZOBOV, REVOLVER, VoidFinder |
| Sky coverage | RA [0Β°, 360Β°], DEC [-35Β°, 90Β°] | DESI footprint |
This ARD integrates 9 DESI DR1 Value-Added Catalogs across galaxy and QSO domains.
| VAC | Purpose | Key Columns | Link |
|---|---|---|---|
| FastSpecFit | Stellar continuum + emission lines | stellar_mass, sfr, dn4000 | docs |
| PROVABGS | Bayesian SED fitting with posteriors | stellar_mass_p50, sfr_p50, age | docs |
| DESIVAST | Cosmic void classifications | void_id, algorithm, effective_radius | docs |
| Gfinder | Halo-based group catalog | group_id, halo_mass, richness | docs |
| VAC | Purpose | Key Columns | Link |
|---|---|---|---|
| AGN/QSO | Spectral + IR classification | agn_type, wise_agn_flag | docs |
| CIV Absorber | Intervening CIV systems | ew_civ, z_abs, vdisp | docs |
| MgII Absorber | Intervening MgII systems | ew_mgii, z_abs | docs |
| QMassIron | Black hole masses | mbh_mgii, lbol, eddington_ratio | docs |
| Stellar Mass/EmLine | CIGALE masses + emission lines | mass_cg, oii_flux, oiii_flux | docs |
The Analysis-Ready Dataset employs a three-layer enrichment model, with PostgreSQL as the materialization engine and Parquet as the distribution format.
graph TD
subgraph "Source Data"
A1[DESI DR1<br/>Public Archive] --> A2[Galaxy VACs<br/>FastSpecFit, PROVABGS]
A1 --> A3[Environment VACs<br/>DESIVAST, Gfinder]
A1 --> A4[QSO VACs<br/>AGN, CIV, MgII, BHMass]
A1 --> A5[Spectral Tiles<br/>12K HEALPix]
end
subgraph "Foundation Layer"
A2 --> B1[PostgreSQL<br/>9 Tables]
A3 --> B1
A4 --> B1
A5 --> B2[Parquet<br/>Spectral Store]
B1 --> B3[Cross-Match<br/>Linkage Index]
B2 --> B3
end
subgraph "Physics Layer"
B3 --> C1[Lick Indices<br/>Dn4000, HΞ΄A]
B3 --> C2[Kinematics<br/>pPXF Fits]
B3 --> C3[Environment<br/>Metrics]
end
subgraph "AI Layer"
C1 --> D1[Spectral<br/>Embeddings]
C2 --> D1
C3 --> D1
end
subgraph "Consumers"
D1 --> E1[Environmental<br/>Quenching Paper]
D1 --> E2[QSO Anomaly<br/>Detection]
D1 --> E3[Quasar<br/>Outflows]
end
style A1 fill:#e3f2fd
style B1 fill:#336791,color:#fff
style B2 fill:#4ecdc4
style D1 fill:#fff3e0
style E1 fill:#c8e6c9
style E2 fill:#f3e5f5
style E3 fill:#f3e5f5
| Layer | Content | Status |
|---|---|---|
| Foundation | Unified catalog + spectral linkage + environmental classifications | π In Progress |
| Physics | Derived quantities: Lick indices, pPXF kinematics, SED posteriors | β¬ Planned |
| AI/Embeddings | Neural spectral embeddings, similarity metrics | β¬ Planned |
Development organized into phases, following a milestone-based structure.
| Phase | Name | Status | Date | Key Outcome |
|---|---|---|---|---|
| 01 | Catalog Acquisition | β Complete | Jul 2025 | 6.4M galaxies + 10.7K voids in PostgreSQL |
| 02 | Catalog Validation | β Complete | Aug 2025 | 3-stage QA, 98.4% retention |
| 03 | Spectral Pipeline | β Complete | Sep 2025 | 10.8K tiles, 98.6% compression |
| 04 | ARD Foundations | β Complete | Dec 2025 | Schema v2.0, 9 VACs, Data Dictionary |
| 05 | VAC ETL Sprint | β¬ Next | 2026 | Ingest remaining 7 VACs |
Tasks are categorized by complexity for planning purposes.
| Tier | Characteristics | Examples |
|---|---|---|
| Easy | SQL-based, no spectral dependency | Lick indices, sSFR derivation, k-NN density |
| Medium | Established tools, compute time | pPXF kinematics, spectral QA, cross-match index |
| Hard | Significant engineering | Bagpipes SED fitting, embedding generation |
desi-cosmic-void-galaxies/
βββ π assets/ # Images, diagrams, banners
βββ π config/ # Configuration files
βββ π data/ # Large files (LFS tracked)
βββ π desi-cosmic-voids-ard/ # ARD output directory
βββ π docs/
β βββ π documentation-standards/ # Templates, tagging strategy
β βββ π ARD-SCHEMA-v2.md # Table structure reference
β βββ π ARD-DATA-DICTIONARY-v2.md # Column definitions
βββ π internal-files/ # Working documents
βββ π output/ # Processing outputs
βββ π shared/ # Cross-cutting assets (SQL schemas, etc.)
βββ π spec/ # Specifications
βββ π staging/ # Staged work
βββ π work-logs/ # Milestone-based development history
β βββ π 01-catalog-acquisition/
β βββ π 02-catalog-validation/
β βββ π 03-spectral-tile-pipeline/
β βββ π 04-ard-foundations/
βββ π AGENTS.md # Agent instructions
βββ π CLAUDE.md # Pointer to AGENTS.md
βββ π ROADMAP.md # Full phase plan
βββ π LICENSE
βββ π LICENSE-DATA
βββ π README.md # This file
This project runs on the radioastronomy.io research cluster.
| Resource | Host | Purpose |
|---|---|---|
| PostgreSQL 16 | radio-pgsql01 (10.25.20.8) | Catalog storage, materialization engine |
| Spectral tiles | radio-fs02 (10.25.20.15) | Parquet storage |
| GPU compute | ML01 (A4000, 16GB) | ML training, embedding generation |
| Project | Focus | Relationship |
|---|---|---|
| This repo | ARD factory + environmental quenching | Central data provider |
| desi-qso-anomaly-detection | ML outlier detection in QSO spectra | Downstream consumer |
| desi-quasar-outflows | AGN feedback energetics | Downstream consumer |
| Resource | Description |
|---|---|
| analysis-ready-dataset | Domain-agnostic ARD methodology |
| astronomy-rag-corpus | Literature corpus supporting DESI research |
| proxmox-astronomy-lab | Infrastructure documentation |
| DESI DR1 Portal | Official data documentation |
| DESIVAST VAC | Void catalog source |
| FastSpecFit VAC | Galaxy properties source |
This project is licensed under the MIT License. See LICENSE for details.
- DESI Collaboration for Data Release 1 public data
- DESIVAST Team for void catalog across 4 algorithms
- FastSpecFit Team for galaxy stellar population measurements
- radioastronomy.io research cluster for infrastructure
Last Updated: 2026-03-29 | Current Phase: ARD Foundations Complete, VAC ETL Sprint Next

