feat(dataset_metrics): config-driven dataset profiler step for the metric page by d0choa · Pull Request #135 · opentargets/pts

d0choa · 2026-05-20T15:15:22Z

Summary

New Polars transformer dataset_metrics (sibling of release_metrics) that profiles every output/* dataset into one parquet per dataset under a root-level metrics/ directory, for the Platform/PPP metric page (opentargets/issues#4328).

Each per-dataset profile row carries:

id, count (read from parquet footers — no data scan), file_size, number_of_partitions
config-driven breakdowns — SQL-expression groupings (pl.sql_expr), with list columns auto-exploded
config-driven filter_counts — SQL filter, optional distinct column

The dataset_metrics step is registered in config.yaml with the metric-page groupings/filters: study (studyType, projectId), credible_set, variant (most-severe consequence), colocalisation (study-type pair), drug_molecule (clinical stage), clinical_report (clinical stage), disease (therapeutic area), l2g_prediction (prioritised genes, L2G>0.5 distinct gene), associations by datasource/datatype (direct & indirect), and an evidence_* pattern key for datatype.

Discovery/count helpers are reused from release_metrics; release_metrics and its HuggingFace pipeline are untouched. A disposable (uncommitted) scripts/metrics_to_jsonl.py converts the output to shareable JSONL.

Test plan

uv run pytest test/test_dataset_metrics.py --doctest-modules src/pts/transformers/dataset_metrics.py → 16 passed
uv run ruff check src/pts/transformers/dataset_metrics.py test/test_dataset_metrics.py → clean
Validated read-only against gs://open-targets-pipeline-runs/ds/26.03-test5/: study count 2,001,827 (+ studyType/projectId breakdowns); colocalisation 217,914,314 (studyTypePair: gwas-gwas 167M, gwas-eqtl 28.9M, …); l2g prioritised_genes (score>0.5 distinct geneId) 15,509
Run pts --step dataset_metrics on Dataproc against 26.03-test5 → writes metrics/<dataset>.parquet; export JSONL for stakeholders

…ount

…g exc detail, distinct guard)

Spark outputs use part-*.parquet but transformer-written datasets (disease, clinical_report, go, so, ...) are single non-part-* parquet files. The previous part-* filter reported 0 file_size/number_of_partitions for those 8 datasets even though counts were correct. Match *.parquet files instead.

d0choa · 2026-05-20T15:49:54Z

✅ Validation run on Dataproc (`26.03-test5`)

Ran the new dataset_metrics step end-to-end on Dataproc against the 26.03-test5 run (read-only inputs), with the branch installed from this PR.

Setup

Cluster: single-node Dataproc (--single-node, image 2.2, n2d-highmem-16). dataset_metrics is a single-node Polars transformer — no Spark workers needed. Branch installed via init action (pts @ git+…@feat/dataset-metrics-profiler).
Command: pts -s dataset_metrics -c dataset_metrics.yaml -r gs://open-targets-pipeline-runs/ds/26.03-test5
Inputs (read-only): gs://open-targets-pipeline-runs/ds/26.03-test5/output/* — 55 datasets discovered.
Outputs: gs://ot-team/ochoa/dataset_metrics/run-002/metrics/ — one parquet per dataset (55 files), scratch location (nothing written to the shared pipeline-runs bucket).

Runtime

Profiled all 55 datasets in ~54 s (transform task; step completed in 56.6 s, ~64 s job wall-clock on the warm node).
Counts/sizes/partitions are footer + storage-listing (no data scan); only the configured low-cardinality groupings/filters touch column data. The largest dataset, colocalisation (200 files, 17.5 GB, 218 M rows), groups by rightStudyType reading only its ~67 MB column.

Headline counts (from the per-dataset profiles)

dataset	count
target	78,691
disease	47,030
drug_molecule	22,230
study	2,001,827
credible_set	3,491,182
variant	7,432,549
colocalisation	217,914,314
clinical_report	285,213
association_overall_direct	4,508,002
association_overall_indirect	12,466,856
Σ evidence_* (20 sources)	34,086,838
l2g_prediction · prioritised_genes (L2G>0.5, distinct gene)	15,509

Example breakdowns

study.studyType: eqtl 1,233,604 · tuqtl 364,196 · sqtl 213,952 · gwas 136,245 · sceqtl 50,075 · pqtl 3,755
colocalisation.studyTypePair: gwas-gwas 167,048,267 · gwas-eqtl 28,944,408 · gwas-tuqtl 8,500,302 · gwas-pqtl 6,264,043 · gwas-sqtl 4,706,298 · gwas-sceqtl 2,450,996
association_by_datatype_indirect.datatype: literature 8,246,446 · genetic_association 2,823,514 · animal_model 2,461,182 · somatic_mutation 406,376 · known_drug 259,530 · rna_expression 166,503 · genetic_literature 123,072 · affected_pathway 80,993
disease.therapeuticArea: all 25 therapeutic areas populated.

`_dataset_file_stats` fix (latest commit)

The first run surfaced that 8 transformer-written datasets (disease, clinical_report, clinical_indication, clinical_target, disease_hpo, disease_phenotype, go, so) reported file_size/number_of_partitions = 0 — their parquet files aren't named part-*. Counts/breakdowns were unaffected. Fixed to count all *.parquet files; re-ran (run-002) and confirmed all 55 datasets now report non-zero file_size/number_of_partitions (e.g. disease 1 file / 7.3 MB, clinical_report 1 file / 74.9 MB).

Sharing

A disposable converter (scripts/metrics_to_jsonl.py, uncommitted) reshapes the per-dataset parquet into one JSON object per dataset (nested breakdowns/filter_counts maps) for stakeholders.

Move count_parquet_rows, discover_dataset_paths and to_parquet_glob (plus their discovery sub-helpers) into pts.transformers.parquet_helpers. release_metrics and dataset_metrics now import them from the shared module, removing dataset_metrics' coupling to release_metrics' private helpers. No behaviour change; release_metrics tests updated to import and monkeypatch the new location.

…d configs Redesign the profiler output to a long/tidy table (run, dataset, kind, metric, expression, group_value, value) written as one combined dataset_metrics.parquet instead of one file per dataset. Derive the run identifier from release_uri/work_path. Fail loud on malformed grouping/filter expressions (naming the dataset and metric) while skipping unreadable datasets with a warning. Hoist collect_schema to once per dataset. Declare fsspec as a direct dependency (DEP003) and bump version to 26.06.19.

…-ordered config matching Address remaining code-review follow-ups: compute_filter_count now excludes nulls from distinct counts (a null is not a distinct value); _config_for_dataset resolves overlapping glob patterns by specificity (longest pattern first, then alphabetical) instead of alphabetical only; and _dataset_file_stats carries a comment noting fsspec and Otter's storage backend both authenticate via ambient ADC, so the listing is not a credential divergence. Adds tests for both behaviours.

d0choa · 2026-05-21T18:27:11Z

Re-validated on Dataproc — combined long/tidy output

Re-ran the current branch on a single-node Dataproc cluster against 26.03-test5 (read-only, 55 datasets); the step completed in 57.6 s.

Output is now a single combined dataset_metrics.parquet in long/tidy form — schema run, dataset, kind, metric, expression, group_value, value, 381 rows across 55 datasets, run = 26.03-test5 (165 scalar + 215 grouping + 1 filter).

Counts and breakdowns match the earlier per-dataset run (e.g. colocalisation 217,914,314; study.studyType eqtl 1,233,604 … pqtl 3,755), and the new expression column documents each metric (e.g. prioritised_genes → distinct geneId where score > 0.5 = 15,509). Written to a scratch bucket only; cluster torn down afterward.

d0choa added 8 commits May 20, 2026 16:19

feat(dataset_metrics): add compute_breakdown grouping helper

ed089d4

feat(dataset_metrics): add compute_filter_count helper

2d79600

feat(dataset_metrics): add storage-agnostic file size and partition c…

2d1fb96

…ount

feat(dataset_metrics): add output schema and list[struct] shaping

09c3d68

feat(dataset_metrics): add profile_dataset row assembly

254d9fd

feat(dataset_metrics): add transformer entry point with discovery

79184d8

refactor(dataset_metrics): address code review (flat-dataset note, lo…

7aa7fc4

…g exc detail, distinct guard)

feat(dataset_metrics): register dataset_metrics step in config

aa410eb

d0choa force-pushed the feat/dataset-metrics-profiler branch from 797afed to aa410eb Compare May 20, 2026 15:20

d0choa added 3 commits May 21, 2026 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(dataset_metrics): config-driven dataset profiler step for the metric page#135

feat(dataset_metrics): config-driven dataset profiler step for the metric page#135
d0choa wants to merge 12 commits into
mainfrom
feat/dataset-metrics-profiler

d0choa commented May 20, 2026

Uh oh!

d0choa commented May 20, 2026

Uh oh!

d0choa commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

d0choa commented May 20, 2026

Summary

Test plan

Uh oh!

d0choa commented May 20, 2026

✅ Validation run on Dataproc (26.03-test5)

Setup

Runtime

Headline counts (from the per-dataset profiles)

Example breakdowns

_dataset_file_stats fix (latest commit)

Sharing

Uh oh!

d0choa commented May 21, 2026

Re-validated on Dataproc — combined long/tidy output

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

✅ Validation run on Dataproc (`26.03-test5`)

`_dataset_file_stats` fix (latest commit)