WIP: feat(metrics): implementation of metric tasks#134
Open
project-defiant wants to merge 32 commits into
Open
WIP: feat(metrics): implementation of metric tasks#134project-defiant wants to merge 32 commits into
project-defiant wants to merge 32 commits into
Conversation
Add standalone pts.metrics framework with abstract base classes and four concrete metric types: CountMetric, DistinctCountMetric, DistributionMetric, GroupedCountMetric, GroupedSumMetric. All 31 tests pass (TDD — tests written before implementation). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DiseaseByTherapeuticAreaMetric explodes the therapeuticAreas array and counts diseases per area. L2GSignificantGeneMetric counts distinct genes with score >= threshold. All 6 custom-metric tests pass (TDD). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reads parquet from dataset_path, computes each metric, stamps release/run
via model_copy, and writes JSON to metrics_root/{dataset}/{name}.json.
All 4 runner tests pass; 41/41 across the full metrics suite.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TransformSpec gains a metrics: list[Metric] field validated at construction time via _load_metric (discriminated union for standard types, importlib for custom). Transform.run() calls MetricRunner after the transformer completes when metrics are configured. Unknown types raise ValueError early. 46/46 tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add release/run to scratchpad and metrics declarations to the disease_efo transform step. Target, drug, association, and GWAS datasets are pyspark steps and need separate pyspark task integration before metrics can be added to those config entries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Allows running metric collection on already-produced datasets without re-running a transformer. Reads release/run from scratchpad. 4 TDD tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nfig MetricRunner now reads per-metric with scan_parquet + column projection, dropping memory usage from full-dataset loads (3-17 GB OOM) to only the needed columns. Each Metric declares required_columns; empty list = row count only. Also adds metrics.yaml and scripts/collect_metrics.py for standalone collection. Verified across 32 datasets, 62 JSON files. 50 tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e 9 metrics - Add UnifiedMetricRecord with flat envelope + result JSON blob so all metrics share a single parquet-compatible JSONL schema per dataset - Add pts.metrics.loader with MetricType enum and load_metric dispatcher - Add type: Literal[...] to each Metric subclass and model_dump() override on both Specs so otter's double-parse in TaskRegistry.build() round-trips cleanly - Add CollectMetrics task (standalone metrics-only pipeline step) - Complete Issue 9: add count_by_datasource to all 20 evidence datasets in config.yaml and metrics.yaml; rewrite metrics.yaml as proper otter config - Remove scripts/collect_metrics.py (superseded by pts -c metrics.yaml -s <step>) - Trim 28 redundant tests: pydantic-default assertions, duplicate loader tests, and overlapping unified-output tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CollectMetrics now handles all post-transform metric collection, so the metrics field, validator, model_dump override, and MetricRunner call in Transform are redundant. Also removes the inline metrics block from the transform disease_efo config step and deletes the now-obsolete test file. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Distribution metrics are not needed in the current pipeline. Removes the distribution.py module, its MetricType/loader registration, all 52 distribution blocks from config.yaml and metrics.yaml, and the test file. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- GroupedCountExplodeMetric: explodes list columns in group_by then counts per group; replaces DiseaseByTherapeuticAreaMetric custom class - CollectMetrics.__init__ now validates that 'release' and 'run' are present in the scratchpad and raises ValueError early if either is absent - metrics.yaml scratchpad requires explicit release/run values at run time - config.yaml gains a collect_metrics disease step using the new type - Removed pts.metrics.custom.disease (no longer needed) - Parametrized noisy validation and empty-result tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- test_metrics_base: 3→1 test (parametrize Metric + subclass abstractness) - test_metrics_count: 6→5 tests (parametrize row-count cases, drop isinstance) - test_metrics_grouped: 9→7 tests (merge same-data pairs for count and explode) - test_metrics_custom: 4→2 tests (parametrize L2G value cases) - test_collect_metrics: unify _make_context helpers, drop _run_task wrapper Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- D102: add missing docstrings to GroupedCountExplodeMetric.required_columns and .compute - PT006: use tuple form for pytest.mark.parametrize first arg in test_metrics_count and test_metrics_custom - I001: sort imports in test_metrics_base and test_metrics_grouped Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…_record() release, run, dataset, source, destination are envelope fields that belong to the runner, not the result. Removing them from MetricResult eliminates the empty-string placeholders that compute() methods were forced to pass. MetricRunner now supplies these directly to to_unified_record() instead of using model_copy(). No empty strings can leak through. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
total_count (row count) is now always prepended automatically by CollectMetricsSpec — no need to declare it in config.yaml or metrics.yaml. Removed all 32 explicit total_count entries from both configs; datasets that only had total_count now omit the metrics key entirely. Uses a model_validator(mode='after') so the injection fires even when metrics defaults to []. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…emove L2G custom metric
- Add `filter_expr: str | None` to `Metric` base class; applied via `pl.sql_expr()` in
`run()` before `compute()`, with all columns loaded when set
- Add `filter_expr` column to `UnifiedMetricRecord` and thread it through `to_unified_record()`
- Support SQL expressions with AS aliases in `group_by` lists for all grouped metrics
(e.g. `"concat('gwas-', studyType) as label"`); resolved via `_resolve_group_by()`
- Remove `required_columns` from grouped metrics (always read all columns)
- Delete `L2GSignificantGeneMetric` and the `custom/` directory; replace its YAML entry
with a `distinct_count` + `filter_expr: "score >= 0.5"` on both config.yaml and metrics.yaml
- Fix GCS path compatibility in `CollectMetrics.run()`: drop `pathlib.Path` wrapping so
`gs://` URIs are not mangled; extract dataset name with string operations
- Consolidate tests: move spec validation and filter_expr cases to doctests in source
modules; remove obsolete `test_bad_column_raises` (tests Polars, not our logic)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
We need a lightweight, reproducible way to compute and record key counts for every
dataset produced by the pipeline — total rows, distinct values, group distributions —
so we can track data quality and catch regressions across releases. This PR introduces
the pts.metrics framework and a standalone metrics.yaml config that covers all 32
datasets.
What's new
pts.metrics — metric definition framework
A Polars-based framework for declaring, computing, and serialising dataset metrics.
compute(df). run() guards against empty DataFrames and wraps with debug logging.
JSON blob into a flat UnifiedMetricRecord, which is written as one Parquet row per
metric.
grouped_sum.
MetricType.load(cfg) is the single constructor used throughout; adding a new metric
type is a one-liner in the enum.
loss).
**/*.parquet glob supports nested partitions.
CollectMetrics — otter task
parquet reads and Polars computations overlap.
dataset, source, and destination.
Config
pts -s -c metrics.yaml.
Testing
scratchpad key checks, YAML config validity, end-to-end Parquet output, relative path
resolution, and bad-column error propagation.
groups.
Notable design decisions
registration step.
serialises list[Metric] using only base-class fields, stripping subclass fields during
otter's internal model_dump round-trip.
concurrency without the complexity of wrapping sync code in asyncio.to_thread.
registers pts.tasks, which is where CollectMetrics lives.