Skip to content

WIP: feat(metrics): implementation of metric tasks#134

Open
project-defiant wants to merge 32 commits into
mainfrom
metrics
Open

WIP: feat(metrics): implementation of metric tasks#134
project-defiant wants to merge 32 commits into
mainfrom
metrics

Conversation

@project-defiant

@project-defiant project-defiant commented May 20, 2026

Copy link
Copy Markdown
Contributor

Motivation

We need a lightweight, reproducible way to compute and record key counts for every
dataset produced by the pipeline — total rows, distinct values, group distributions —
so we can track data quality and catch regressions across releases. This PR introduces
the pts.metrics framework and a standalone metrics.yaml config that covers all 32
datasets.


What's new

pts.metrics — metric definition framework

A Polars-based framework for declaring, computing, and serialising dataset metrics.

  • Metric (abstract) — Pydantic base; subclasses declare config fields and implement
    compute(df). run() guards against empty DataFrames and wraps with debug logging.
  • MetricResult (abstract) — to_unified_record() serialises metric-specific fields as a
    JSON blob into a flat UnifiedMetricRecord, which is written as one Parquet row per
    metric.
  • Built-in types — count, distinct_count, grouped_count, grouped_count_explode,
    grouped_sum.
  • Custom types — l2g_significant_gene (genes with L2G score ≥ threshold).
  • MetricType enum — member name = YAML type field, value = implementer class.
    MetricType.load(cfg) is the single constructor used throughout; adding a new metric
    type is a one-liner in the enum.
  • Null keys are preserved as their own group in all grouped metrics (no silent data
    loss).
  • Parquet is read with per-metric column projection to minimise memory use; recursive
    **/*.parquet glob supports nested partitions.

CollectMetrics — otter task

  • CollectMetricsSpec validates metric definitions at config-load time.
  • Metrics are run concurrently via ThreadPoolExecutor — one thread per metric, so
    parquet reads and Polars computations overlap.
  • Output is a single Parquet file with one row per metric, stamped with release, run,
    dataset, source, and destination.

Config

  • metrics.yaml — standalone otter config covering all 32 datasets. Run any step with
    pts -s -c metrics.yaml.
  • config.yaml — collect_metrics steps added alongside the existing transform steps.

Testing

  • Unit tests for EmptyDatasetError, abstract-class enforcement, spec validation,
    scratchpad key checks, YAML config validity, end-to-end Parquet output, relative path
    resolution, and bad-column error propagation.
  • Doctests in count.py and grouped.py covering single- and multi-column cases with null
    groups.
  • All 356 tests pass. All 32 metrics.yaml steps verified against real data.

Notable design decisions

  • MetricType enum over a registry dict — adding a new type is one line; no separate
    registration step.
  • field_serializer on CollectMetricsSpec.metrics — necessary because Pydantic v2
    serialises list[Metric] using only base-class fields, stripping subclass fields during
    otter's internal model_dump round-trip.
  • ThreadPoolExecutor over asyncio — Polars has no async API; threading gives real
    concurrency without the complexity of wrapping sync code in asyncio.to_thread.
  • pts CLI, not otter — the otter CLI only registers built-in task types; pts also
    registers pts.tasks, which is where CollectMetrics lives.

SzymonSzyszkowski and others added 27 commits May 20, 2026 12:18
Add standalone pts.metrics framework with abstract base classes and
four concrete metric types: CountMetric, DistinctCountMetric,
DistributionMetric, GroupedCountMetric, GroupedSumMetric.
All 31 tests pass (TDD — tests written before implementation).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DiseaseByTherapeuticAreaMetric explodes the therapeuticAreas array and
counts diseases per area. L2GSignificantGeneMetric counts distinct genes
with score >= threshold. All 6 custom-metric tests pass (TDD).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reads parquet from dataset_path, computes each metric, stamps release/run
via model_copy, and writes JSON to metrics_root/{dataset}/{name}.json.
All 4 runner tests pass; 41/41 across the full metrics suite.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TransformSpec gains a metrics: list[Metric] field validated at construction
time via _load_metric (discriminated union for standard types, importlib for
custom). Transform.run() calls MetricRunner after the transformer completes
when metrics are configured. Unknown types raise ValueError early. 46/46 tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add release/run to scratchpad and metrics declarations to the disease_efo
transform step. Target, drug, association, and GWAS datasets are pyspark
steps and need separate pyspark task integration before metrics can be
added to those config entries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Allows running metric collection on already-produced datasets without
re-running a transformer. Reads release/run from scratchpad. 4 TDD tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nfig

MetricRunner now reads per-metric with scan_parquet + column projection,
dropping memory usage from full-dataset loads (3-17 GB OOM) to only the
needed columns. Each Metric declares required_columns; empty list = row
count only. Also adds metrics.yaml and scripts/collect_metrics.py for
standalone collection. Verified across 32 datasets, 62 JSON files. 50 tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e 9 metrics

- Add UnifiedMetricRecord with flat envelope + result JSON blob so all metrics
  share a single parquet-compatible JSONL schema per dataset
- Add pts.metrics.loader with MetricType enum and load_metric dispatcher
- Add type: Literal[...] to each Metric subclass and model_dump() override on
  both Specs so otter's double-parse in TaskRegistry.build() round-trips cleanly
- Add CollectMetrics task (standalone metrics-only pipeline step)
- Complete Issue 9: add count_by_datasource to all 20 evidence datasets in
  config.yaml and metrics.yaml; rewrite metrics.yaml as proper otter config
- Remove scripts/collect_metrics.py (superseded by pts -c metrics.yaml -s <step>)
- Trim 28 redundant tests: pydantic-default assertions, duplicate loader tests,
  and overlapping unified-output tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CollectMetrics now handles all post-transform metric collection, so the
metrics field, validator, model_dump override, and MetricRunner call in
Transform are redundant. Also removes the inline metrics block from the
transform disease_efo config step and deletes the now-obsolete test file.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Distribution metrics are not needed in the current pipeline. Removes
the distribution.py module, its MetricType/loader registration, all 52
distribution blocks from config.yaml and metrics.yaml, and the test file.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- GroupedCountExplodeMetric: explodes list columns in group_by then counts
  per group; replaces DiseaseByTherapeuticAreaMetric custom class
- CollectMetrics.__init__ now validates that 'release' and 'run' are
  present in the scratchpad and raises ValueError early if either is absent
- metrics.yaml scratchpad requires explicit release/run values at run time
- config.yaml gains a collect_metrics disease step using the new type
- Removed pts.metrics.custom.disease (no longer needed)
- Parametrized noisy validation and empty-result tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- test_metrics_base: 3→1 test (parametrize Metric + subclass abstractness)
- test_metrics_count: 6→5 tests (parametrize row-count cases, drop isinstance)
- test_metrics_grouped: 9→7 tests (merge same-data pairs for count and explode)
- test_metrics_custom: 4→2 tests (parametrize L2G value cases)
- test_collect_metrics: unify _make_context helpers, drop _run_task wrapper

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- D102: add missing docstrings to GroupedCountExplodeMetric.required_columns and .compute
- PT006: use tuple form for pytest.mark.parametrize first arg in test_metrics_count and test_metrics_custom
- I001: sort imports in test_metrics_base and test_metrics_grouped

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…_record()

release, run, dataset, source, destination are envelope fields that belong
to the runner, not the result. Removing them from MetricResult eliminates
the empty-string placeholders that compute() methods were forced to pass.
MetricRunner now supplies these directly to to_unified_record() instead of
using model_copy(). No empty strings can leak through.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@project-defiant project-defiant changed the title Metrics feat(metrics): implementation of metric tasks May 21, 2026
@project-defiant project-defiant marked this pull request as ready for review May 21, 2026 00:14
total_count (row count) is now always prepended automatically by
CollectMetricsSpec — no need to declare it in config.yaml or metrics.yaml.
Removed all 32 explicit total_count entries from both configs; datasets
that only had total_count now omit the metrics key entirely.

Uses a model_validator(mode='after') so the injection fires even when
metrics defaults to [].

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SzymonSzyszkowski and others added 4 commits May 21, 2026 10:43
…emove L2G custom metric

- Add `filter_expr: str | None` to `Metric` base class; applied via `pl.sql_expr()` in
  `run()` before `compute()`, with all columns loaded when set
- Add `filter_expr` column to `UnifiedMetricRecord` and thread it through `to_unified_record()`
- Support SQL expressions with AS aliases in `group_by` lists for all grouped metrics
  (e.g. `"concat('gwas-', studyType) as label"`); resolved via `_resolve_group_by()`
- Remove `required_columns` from grouped metrics (always read all columns)
- Delete `L2GSignificantGeneMetric` and the `custom/` directory; replace its YAML entry
  with a `distinct_count` + `filter_expr: "score >= 0.5"` on both config.yaml and metrics.yaml
- Fix GCS path compatibility in `CollectMetrics.run()`: drop `pathlib.Path` wrapping so
  `gs://` URIs are not mangled; extract dataset name with string operations
- Consolidate tests: move spec validation and filter_expr cases to doctests in source
  modules; remove obsolete `test_bad_column_raises` (tests Polars, not our logic)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@project-defiant project-defiant changed the title feat(metrics): implementation of metric tasks WIP: feat(metrics): implementation of metric tasks May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants