OpenGenome is a local-first, open-source toolkit for ingesting consumer DNA raw files and generating quality-control, ancestry positioning, and homozygosity (ROH) artifacts.
Current release: v0.1.1.
- Local-first CLI contracts are stable for
ingest,qc,pca,roh, andviewer. - Ancestry modeling core and interactive viewer workspace are available.
- Reference automation is available via
opengenome ancestry reference ....
OpenGenome currently exposes these stable command groups.
-
opengenome ingest <raw_file> --vendor ancestry --out data/user.parquetInput: vendor raw DNA text file. Output: normalizeddata/user.parquet. -
opengenome qc data/user.parquet --out results/qc/Input: normalized user parquet. Outputs:results/qc/qc_summary.jsonand QC PNG plots. -
opengenome pca data/user.parquet --reference data/reference/tiny_panel.parquet --out results/pca/Input: normalized user parquet plus bundled tiny reference panel. Outputs:results/pca/pca_coords.csv,results/pca/pca_plot.png,results/pca/closest_pops.json. -
opengenome roh data/user.parquet --out results/roh/Input: normalized user parquet. Outputs:results/roh/roh_summary.json,results/roh/roh_segments_by_chrom.png,results/roh/roh_length_distribution.png. -
opengenome viewer --results results/ [--host 127.0.0.1] [--port 8501] [--headless] [--dry-run]Input: root results directory. Behavior: launches a local Streamlit viewer for QC/PCA/ROH artifact inspection plus interactive ancestry modeling.--dry-runprints the resolved Streamlit launch command and exits without executing. -
opengenome ancestry reference fetch --source-url <manifest_url> --out data/reference/manifest.jsonInput: manifest URL. Output: validated manifest JSON on disk. -
opengenome ancestry reference validate --manifest ... --raw-root ...Input: manifest and artifact root. Behavior: validates manifest schema and required artifacts. -
opengenome ancestry reference build --manifest ... --raw-root ... --out data/reference/panel.parquetInput: validated manifest and build parameters. Output: built panel parquet plus required provenance sidecar at<out>.provenance.json.
Stable output tree:
data/
user.parquet
reference/
manifest.json
panel.parquet
panel.provenance.json
tiny_panel.parquet
results/
qc/
qc_summary.json
snps_per_chrom.png
pca/
pca_coords.csv
pca_plot.png
closest_pops.json
roh/
roh_summary.json
roh_segments_by_chrom.png
roh_length_distribution.png
Output schema details are documented in docs/output_contracts.md.
python -m pip install -e ".[dev]"
opengenome ingest path/to/AncestryDNA.txt --vendor ancestry --out data/user.parquet
opengenome qc data/user.parquet --out results/qc/
opengenome pca data/user.parquet --reference data/reference/phase3_subset.parquet --out results/pca/
opengenome roh data/user.parquet --out results/roh/
python -m pip install -e ".[viewer]"
opengenome viewer --results results/
opengenome viewer --results results/ --host 127.0.0.1 --port 8501 --headless
opengenome viewer --results results/ --dry-runThis quickstart assumes you already built or provided a reference parquet at data/reference/phase3_subset.parquet.
See docs/reference_panels.md for manual panel setup from 1000 Genomes Phase 3 files.
- Ancestry parser supports both raw export formats:
- combined genotype column (
allele) - split genotype columns (
allele1,allele2)
- combined genotype column (
- Parser normalizes
0 -> -and acceptsI/Dalleles used in some Ancestry exports. - QC chromosome labels are normalized for display and summaries:
- numeric aliases
23/24/25/26map toX/Y/XY/MT - deterministic chromosome ordering is natural (
1..22, X, Y, XY, MT)
- numeric aliases
- Reference workflows are available via:
opengenome ancestry reference fetchopengenome ancestry reference validateopengenome ancestry reference build
- Real Phase 3 raw reference files are still expected under
data/reference/raw/forphase3_vcf_bundlebuilds. - Large/sensitive local artifacts (personal DNA data, bulk reference raws, generated results) are intentionally ignored via
.gitignore.
make lint
make typecheck
make test- User-friendly results guide:
docs/understanding_results.md - CLI reference:
docs/usage.md - Methods and caveats:
docs/methods.md - Reference panel integration:
docs/reference_panels.md - Output contracts:
docs/output_contracts.md
- Streamlit missing:
- Symptom:
viewer failed: Streamlit is not installed - Fix:
python -m pip install -e ".[viewer]"
- Symptom:
- Interactive Plotly charts missing:
- Symptom: QC/PCA interactive charts do not render
- Fix:
python -m pip install plotly==5.24.1
- Verify launch command without execution:
opengenome viewer --results results/ --dry-run
- Partial or missing artifacts in UI:
- Run producing commands:
opengenome qc data/user.parquet --out results/qc/opengenome pca data/user.parquet --reference data/reference/phase3_subset.parquet --out results/pca/opengenome roh data/user.parquet --out results/roh/
- Viewer status panel reports missing file paths and discovered summary schema versions.
- Run producing commands:
- Supports
ancestryvendor input only. - PCA output is sensitive to reference composition and SNP overlap quality.
- No dedicated
opengenome ancestry fit|compare|exportCLI commands yet (interactive ancestry flow is in viewer). - No ancestry percentage estimation;
closest_pops.jsonis distance-based ranking only. - ROH segment calls are threshold-dependent and sensitive to SNP density and missingness.
- ROH output is exploratory population-genetics context, not medical or diagnostic evidence.
- No medical, diagnostic, or health interpretation.