Skip to content

feat(dataset_metrics): config-driven dataset profiler step for the metric page#135

Draft
d0choa wants to merge 12 commits into
mainfrom
feat/dataset-metrics-profiler
Draft

feat(dataset_metrics): config-driven dataset profiler step for the metric page#135
d0choa wants to merge 12 commits into
mainfrom
feat/dataset-metrics-profiler

Conversation

@d0choa

@d0choa d0choa commented May 20, 2026

Copy link
Copy Markdown
Contributor

Summary

New Polars transformer dataset_metrics (sibling of release_metrics) that profiles every output/* dataset into one parquet per dataset under a root-level metrics/ directory, for the Platform/PPP metric page (opentargets/issues#4328).

Each per-dataset profile row carries:

  • id, count (read from parquet footers — no data scan), file_size, number_of_partitions
  • config-driven breakdowns — SQL-expression groupings (pl.sql_expr), with list columns auto-exploded
  • config-driven filter_counts — SQL filter, optional distinct column

The dataset_metrics step is registered in config.yaml with the metric-page groupings/filters: study (studyType, projectId), credible_set, variant (most-severe consequence), colocalisation (study-type pair), drug_molecule (clinical stage), clinical_report (clinical stage), disease (therapeutic area), l2g_prediction (prioritised genes, L2G>0.5 distinct gene), associations by datasource/datatype (direct & indirect), and an evidence_* pattern key for datatype.

Discovery/count helpers are reused from release_metrics; release_metrics and its HuggingFace pipeline are untouched. A disposable (uncommitted) scripts/metrics_to_jsonl.py converts the output to shareable JSONL.

Test plan

  • uv run pytest test/test_dataset_metrics.py --doctest-modules src/pts/transformers/dataset_metrics.py → 16 passed
  • uv run ruff check src/pts/transformers/dataset_metrics.py test/test_dataset_metrics.py → clean
  • Validated read-only against gs://open-targets-pipeline-runs/ds/26.03-test5/: study count 2,001,827 (+ studyType/projectId breakdowns); colocalisation 217,914,314 (studyTypePair: gwas-gwas 167M, gwas-eqtl 28.9M, …); l2g prioritised_genes (score>0.5 distinct geneId) 15,509
  • Run pts --step dataset_metrics on Dataproc against 26.03-test5 → writes metrics/<dataset>.parquet; export JSONL for stakeholders

@d0choa d0choa force-pushed the feat/dataset-metrics-profiler branch from 797afed to aa410eb Compare May 20, 2026 15:20
Spark outputs use part-*.parquet but transformer-written datasets (disease,
clinical_report, go, so, ...) are single non-part-* parquet files. The previous
part-* filter reported 0 file_size/number_of_partitions for those 8 datasets
even though counts were correct. Match *.parquet files instead.
@d0choa

d0choa commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

✅ Validation run on Dataproc (26.03-test5)

Ran the new dataset_metrics step end-to-end on Dataproc against the 26.03-test5 run (read-only inputs), with the branch installed from this PR.

Setup

  • Cluster: single-node Dataproc (--single-node, image 2.2, n2d-highmem-16). dataset_metrics is a single-node Polars transformer — no Spark workers needed. Branch installed via init action (pts @ git+…@feat/dataset-metrics-profiler).
  • Command: pts -s dataset_metrics -c dataset_metrics.yaml -r gs://open-targets-pipeline-runs/ds/26.03-test5
  • Inputs (read-only): gs://open-targets-pipeline-runs/ds/26.03-test5/output/*55 datasets discovered.
  • Outputs: gs://ot-team/ochoa/dataset_metrics/run-002/metrics/ — one parquet per dataset (55 files), scratch location (nothing written to the shared pipeline-runs bucket).

Runtime

  • Profiled all 55 datasets in ~54 s (transform task; step completed in 56.6 s, ~64 s job wall-clock on the warm node).
  • Counts/sizes/partitions are footer + storage-listing (no data scan); only the configured low-cardinality groupings/filters touch column data. The largest dataset, colocalisation (200 files, 17.5 GB, 218 M rows), groups by rightStudyType reading only its ~67 MB column.

Headline counts (from the per-dataset profiles)

dataset count
target 78,691
disease 47,030
drug_molecule 22,230
study 2,001,827
credible_set 3,491,182
variant 7,432,549
colocalisation 217,914,314
clinical_report 285,213
association_overall_direct 4,508,002
association_overall_indirect 12,466,856
Σ evidence_* (20 sources) 34,086,838
l2g_prediction · prioritised_genes (L2G>0.5, distinct gene) 15,509

Example breakdowns

  • study.studyType: eqtl 1,233,604 · tuqtl 364,196 · sqtl 213,952 · gwas 136,245 · sceqtl 50,075 · pqtl 3,755
  • colocalisation.studyTypePair: gwas-gwas 167,048,267 · gwas-eqtl 28,944,408 · gwas-tuqtl 8,500,302 · gwas-pqtl 6,264,043 · gwas-sqtl 4,706,298 · gwas-sceqtl 2,450,996
  • association_by_datatype_indirect.datatype: literature 8,246,446 · genetic_association 2,823,514 · animal_model 2,461,182 · somatic_mutation 406,376 · known_drug 259,530 · rna_expression 166,503 · genetic_literature 123,072 · affected_pathway 80,993
  • disease.therapeuticArea: all 25 therapeutic areas populated.

_dataset_file_stats fix (latest commit)

The first run surfaced that 8 transformer-written datasets (disease, clinical_report, clinical_indication, clinical_target, disease_hpo, disease_phenotype, go, so) reported file_size/number_of_partitions = 0 — their parquet files aren't named part-*. Counts/breakdowns were unaffected. Fixed to count all *.parquet files; re-ran (run-002) and confirmed all 55 datasets now report non-zero file_size/number_of_partitions (e.g. disease 1 file / 7.3 MB, clinical_report 1 file / 74.9 MB).

Sharing

A disposable converter (scripts/metrics_to_jsonl.py, uncommitted) reshapes the per-dataset parquet into one JSON object per dataset (nested breakdowns/filter_counts maps) for stakeholders.

d0choa added 3 commits May 21, 2026 18:16
Move count_parquet_rows, discover_dataset_paths and to_parquet_glob (plus their discovery sub-helpers) into pts.transformers.parquet_helpers. release_metrics and dataset_metrics now import them from the shared module, removing dataset_metrics' coupling to release_metrics' private helpers. No behaviour change; release_metrics tests updated to import and monkeypatch the new location.
…d configs

Redesign the profiler output to a long/tidy table (run, dataset, kind, metric, expression, group_value, value) written as one combined dataset_metrics.parquet instead of one file per dataset. Derive the run identifier from release_uri/work_path. Fail loud on malformed grouping/filter expressions (naming the dataset and metric) while skipping unreadable datasets with a warning. Hoist collect_schema to once per dataset. Declare fsspec as a direct dependency (DEP003) and bump version to 26.06.19.
…-ordered config matching

Address remaining code-review follow-ups: compute_filter_count now excludes nulls from distinct counts (a null is not a distinct value); _config_for_dataset resolves overlapping glob patterns by specificity (longest pattern first, then alphabetical) instead of alphabetical only; and _dataset_file_stats carries a comment noting fsspec and Otter's storage backend both authenticate via ambient ADC, so the listing is not a credential divergence. Adds tests for both behaviours.
@d0choa

d0choa commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

Re-validated on Dataproc — combined long/tidy output

Re-ran the current branch on a single-node Dataproc cluster against 26.03-test5 (read-only, 55 datasets); the step completed in 57.6 s.

Output is now a single combined dataset_metrics.parquet in long/tidy form — schema run, dataset, kind, metric, expression, group_value, value, 381 rows across 55 datasets, run = 26.03-test5 (165 scalar + 215 grouping + 1 filter).

Counts and breakdowns match the earlier per-dataset run (e.g. colocalisation 217,914,314; study.studyType eqtl 1,233,604 … pqtl 3,755), and the new expression column documents each metric (e.g. prioritised_genesdistinct geneId where score > 0.5 = 15,509). Written to a scratch bucket only; cluster torn down afterward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant