feat(dataset_metrics): config-driven dataset profiler step for the metric page#135
feat(dataset_metrics): config-driven dataset profiler step for the metric page#135d0choa wants to merge 12 commits into
Conversation
…g exc detail, distinct guard)
797afed to
aa410eb
Compare
Spark outputs use part-*.parquet but transformer-written datasets (disease, clinical_report, go, so, ...) are single non-part-* parquet files. The previous part-* filter reported 0 file_size/number_of_partitions for those 8 datasets even though counts were correct. Match *.parquet files instead.
✅ Validation run on Dataproc (
|
| dataset | count |
|---|---|
| target | 78,691 |
| disease | 47,030 |
| drug_molecule | 22,230 |
| study | 2,001,827 |
| credible_set | 3,491,182 |
| variant | 7,432,549 |
| colocalisation | 217,914,314 |
| clinical_report | 285,213 |
| association_overall_direct | 4,508,002 |
| association_overall_indirect | 12,466,856 |
| Σ evidence_* (20 sources) | 34,086,838 |
| l2g_prediction · prioritised_genes (L2G>0.5, distinct gene) | 15,509 |
Example breakdowns
study.studyType: eqtl 1,233,604 · tuqtl 364,196 · sqtl 213,952 · gwas 136,245 · sceqtl 50,075 · pqtl 3,755colocalisation.studyTypePair: gwas-gwas 167,048,267 · gwas-eqtl 28,944,408 · gwas-tuqtl 8,500,302 · gwas-pqtl 6,264,043 · gwas-sqtl 4,706,298 · gwas-sceqtl 2,450,996association_by_datatype_indirect.datatype: literature 8,246,446 · genetic_association 2,823,514 · animal_model 2,461,182 · somatic_mutation 406,376 · known_drug 259,530 · rna_expression 166,503 · genetic_literature 123,072 · affected_pathway 80,993disease.therapeuticArea: all 25 therapeutic areas populated.
_dataset_file_stats fix (latest commit)
The first run surfaced that 8 transformer-written datasets (disease, clinical_report, clinical_indication, clinical_target, disease_hpo, disease_phenotype, go, so) reported file_size/number_of_partitions = 0 — their parquet files aren't named part-*. Counts/breakdowns were unaffected. Fixed to count all *.parquet files; re-ran (run-002) and confirmed all 55 datasets now report non-zero file_size/number_of_partitions (e.g. disease 1 file / 7.3 MB, clinical_report 1 file / 74.9 MB).
Sharing
A disposable converter (scripts/metrics_to_jsonl.py, uncommitted) reshapes the per-dataset parquet into one JSON object per dataset (nested breakdowns/filter_counts maps) for stakeholders.
Move count_parquet_rows, discover_dataset_paths and to_parquet_glob (plus their discovery sub-helpers) into pts.transformers.parquet_helpers. release_metrics and dataset_metrics now import them from the shared module, removing dataset_metrics' coupling to release_metrics' private helpers. No behaviour change; release_metrics tests updated to import and monkeypatch the new location.
…d configs Redesign the profiler output to a long/tidy table (run, dataset, kind, metric, expression, group_value, value) written as one combined dataset_metrics.parquet instead of one file per dataset. Derive the run identifier from release_uri/work_path. Fail loud on malformed grouping/filter expressions (naming the dataset and metric) while skipping unreadable datasets with a warning. Hoist collect_schema to once per dataset. Declare fsspec as a direct dependency (DEP003) and bump version to 26.06.19.
…-ordered config matching Address remaining code-review follow-ups: compute_filter_count now excludes nulls from distinct counts (a null is not a distinct value); _config_for_dataset resolves overlapping glob patterns by specificity (longest pattern first, then alphabetical) instead of alphabetical only; and _dataset_file_stats carries a comment noting fsspec and Otter's storage backend both authenticate via ambient ADC, so the listing is not a credential divergence. Adds tests for both behaviours.
Re-validated on Dataproc — combined long/tidy outputRe-ran the current branch on a single-node Dataproc cluster against Output is now a single combined Counts and breakdowns match the earlier per-dataset run (e.g. |
Summary
New Polars transformer
dataset_metrics(sibling ofrelease_metrics) that profiles everyoutput/*dataset into one parquet per dataset under a root-levelmetrics/directory, for the Platform/PPP metric page (opentargets/issues#4328).Each per-dataset profile row carries:
id,count(read from parquet footers — no data scan),file_size,number_of_partitionsbreakdowns— SQL-expression groupings (pl.sql_expr), with list columns auto-explodedfilter_counts— SQL filter, optionaldistinctcolumnThe
dataset_metricsstep is registered inconfig.yamlwith the metric-page groupings/filters: study (studyType, projectId), credible_set, variant (most-severe consequence), colocalisation (study-type pair), drug_molecule (clinical stage), clinical_report (clinical stage), disease (therapeutic area), l2g_prediction (prioritised genes, L2G>0.5 distinct gene), associations by datasource/datatype (direct & indirect), and anevidence_*pattern key for datatype.Discovery/count helpers are reused from
release_metrics;release_metricsand its HuggingFace pipeline are untouched. A disposable (uncommitted)scripts/metrics_to_jsonl.pyconverts the output to shareable JSONL.Test plan
uv run pytest test/test_dataset_metrics.py --doctest-modules src/pts/transformers/dataset_metrics.py→ 16 passeduv run ruff check src/pts/transformers/dataset_metrics.py test/test_dataset_metrics.py→ cleangs://open-targets-pipeline-runs/ds/26.03-test5/: study count 2,001,827 (+ studyType/projectId breakdowns); colocalisation 217,914,314 (studyTypePair: gwas-gwas 167M, gwas-eqtl 28.9M, …); l2g prioritised_genes (score>0.5 distinct geneId) 15,509pts --step dataset_metricson Dataproc against 26.03-test5 → writesmetrics/<dataset>.parquet; export JSONL for stakeholders