Skip to content

mwt2212/opengenome

Repository files navigation

OpenGenome

OpenGenome is a local-first, open-source toolkit for ingesting consumer DNA raw files and generating quality-control, ancestry positioning, and homozygosity (ROH) artifacts.

Status

Current release: v0.1.1.

  • Local-first CLI contracts are stable for ingest, qc, pca, roh, and viewer.
  • Ancestry modeling core and interactive viewer workspace are available.
  • Reference automation is available via opengenome ancestry reference ....

Current Command Contract

OpenGenome currently exposes these stable command groups.

  1. opengenome ingest <raw_file> --vendor ancestry --out data/user.parquet Input: vendor raw DNA text file. Output: normalized data/user.parquet.

  2. opengenome qc data/user.parquet --out results/qc/ Input: normalized user parquet. Outputs: results/qc/qc_summary.json and QC PNG plots.

  3. opengenome pca data/user.parquet --reference data/reference/tiny_panel.parquet --out results/pca/ Input: normalized user parquet plus bundled tiny reference panel. Outputs: results/pca/pca_coords.csv, results/pca/pca_plot.png, results/pca/closest_pops.json.

  4. opengenome roh data/user.parquet --out results/roh/ Input: normalized user parquet. Outputs: results/roh/roh_summary.json, results/roh/roh_segments_by_chrom.png, results/roh/roh_length_distribution.png.

  5. opengenome viewer --results results/ [--host 127.0.0.1] [--port 8501] [--headless] [--dry-run] Input: root results directory. Behavior: launches a local Streamlit viewer for QC/PCA/ROH artifact inspection plus interactive ancestry modeling. --dry-run prints the resolved Streamlit launch command and exits without executing.

  6. opengenome ancestry reference fetch --source-url <manifest_url> --out data/reference/manifest.json Input: manifest URL. Output: validated manifest JSON on disk.

  7. opengenome ancestry reference validate --manifest ... --raw-root ... Input: manifest and artifact root. Behavior: validates manifest schema and required artifacts.

  8. opengenome ancestry reference build --manifest ... --raw-root ... --out data/reference/panel.parquet Input: validated manifest and build parameters. Output: built panel parquet plus required provenance sidecar at <out>.provenance.json.

Stable output tree:

data/
  user.parquet
  reference/
    manifest.json
    panel.parquet
    panel.provenance.json
    tiny_panel.parquet
results/
  qc/
    qc_summary.json
    snps_per_chrom.png
  pca/
    pca_coords.csv
    pca_plot.png
    closest_pops.json
  roh/
    roh_summary.json
    roh_segments_by_chrom.png
    roh_length_distribution.png

Output schema details are documented in docs/output_contracts.md.

Quickstart

python -m pip install -e ".[dev]"
opengenome ingest path/to/AncestryDNA.txt --vendor ancestry --out data/user.parquet
opengenome qc data/user.parquet --out results/qc/
opengenome pca data/user.parquet --reference data/reference/phase3_subset.parquet --out results/pca/
opengenome roh data/user.parquet --out results/roh/
python -m pip install -e ".[viewer]"
opengenome viewer --results results/
opengenome viewer --results results/ --host 127.0.0.1 --port 8501 --headless
opengenome viewer --results results/ --dry-run

This quickstart assumes you already built or provided a reference parquet at data/reference/phase3_subset.parquet. See docs/reference_panels.md for manual panel setup from 1000 Genomes Phase 3 files.

Operational Notes

  • Ancestry parser supports both raw export formats:
    • combined genotype column (allele)
    • split genotype columns (allele1, allele2)
  • Parser normalizes 0 -> - and accepts I/D alleles used in some Ancestry exports.
  • QC chromosome labels are normalized for display and summaries:
    • numeric aliases 23/24/25/26 map to X/Y/XY/MT
    • deterministic chromosome ordering is natural (1..22, X, Y, XY, MT)
  • Reference workflows are available via:
    • opengenome ancestry reference fetch
    • opengenome ancestry reference validate
    • opengenome ancestry reference build
  • Real Phase 3 raw reference files are still expected under data/reference/raw/ for phase3_vcf_bundle builds.
  • Large/sensitive local artifacts (personal DNA data, bulk reference raws, generated results) are intentionally ignored via .gitignore.

Development

make lint
make typecheck
make test

Documentation

  • User-friendly results guide: docs/understanding_results.md
  • CLI reference: docs/usage.md
  • Methods and caveats: docs/methods.md
  • Reference panel integration: docs/reference_panels.md
  • Output contracts: docs/output_contracts.md

Viewer Troubleshooting

  • Streamlit missing:
    • Symptom: viewer failed: Streamlit is not installed
    • Fix: python -m pip install -e ".[viewer]"
  • Interactive Plotly charts missing:
    • Symptom: QC/PCA interactive charts do not render
    • Fix: python -m pip install plotly==5.24.1
  • Verify launch command without execution:
    • opengenome viewer --results results/ --dry-run
  • Partial or missing artifacts in UI:
    • Run producing commands:
      • opengenome qc data/user.parquet --out results/qc/
      • opengenome pca data/user.parquet --reference data/reference/phase3_subset.parquet --out results/pca/
      • opengenome roh data/user.parquet --out results/roh/
    • Viewer status panel reports missing file paths and discovered summary schema versions.

Known Limitations

  • Supports ancestry vendor input only.
  • PCA output is sensitive to reference composition and SNP overlap quality.
  • No dedicated opengenome ancestry fit|compare|export CLI commands yet (interactive ancestry flow is in viewer).
  • No ancestry percentage estimation; closest_pops.json is distance-based ranking only.
  • ROH segment calls are threshold-dependent and sensitive to SNP density and missingness.
  • ROH output is exploratory population-genetics context, not medical or diagnostic evidence.
  • No medical, diagnostic, or health interpretation.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors