Skip to content

perf(summary): use dataset-record fields when /document-class-counts times out#104

Open
audriB wants to merge 1 commit into
mainfrom
perf/use-record-fields-on-counts-timeout
Open

perf(summary): use dataset-record fields when /document-class-counts times out#104
audriB wants to merge 1 commit into
mainfrom
perf/use-record-fields-on-counts-timeout

Conversation

@audriB
Copy link
Copy Markdown
Contributor

@audriB audriB commented Apr 27, 2026

Why

Smoke-test pass after PRs #101/#102/#103 found that for the largest production datasets (Jess Haley 78k docs, Sophie Griswold 101k docs) the `/document-class-counts` endpoint still exhausts its 20s stage-1 deadline on every synthesis attempt. The cloud's Mongo aggregation on `className` without proper indexes simply doesn't fit in the budget.

Net effect: every cron warm cycle landed a degraded summary with all zero counts AND null per-class facts (because `counts.subjects=0` short-circuited stage 2's openminds_subject fanout). The DatasetSummaryCard rendered "0 sessions, 0 subjects, …, Not applicable" everywhere, even though the dataset record itself reports `numberOfSubjects=1656` and `documentCount=78687`.

The frontend band-aid (Waltham-Data-Science/ndi-cloud-app#91) papers over this on the rendering side by enriching the degraded summary with record fields. This PR is the real fix at the backend: when stage-1 counts times out, synthesize a counts envelope from the dataset record's raw fields and gate stage 2 from those fields directly. Stage 2 still attempts per-class fanouts; the species/brainRegions/strains/sexes ndiqueries succeed in isolation (only `/document-class-counts` is pathologically slow on these datasets).

Before vs after

```
Before:
counts: { sessions: 0, subjects: 0, probes: 0, elements: 0,
epochs: 0, totalDocuments: 0 }
species: null ← stage 2 short-circuited
brainRegions: null ← stage 2 short-circuited
extractionWarnings: ["class counts query failed: ..."]

After (Jess Haley):
counts: { sessions: 0, subjects: 1656, probes: 0, elements: 0,
epochs: 0, totalDocuments: 78687 }
species: [Caenorhabditis elegans, Escherichia coli] ← REAL stage-2 data
brainRegions: [whole nervous system] ← REAL stage-2 data
strains/sexes: real data when available
extractionWarnings: ["class counts query failed: ..."]
```

Worst-case timing budget

Stage Time
stage 1 (counts + dataset metadata, parallel) 20s (counts hits deadline; metadata succeeds fast)
stage 2 (3 per-class ndiqueries in parallel) 25s (each bounded by per-class deadline)
ontology resolution ~2s
total ~47s (well under Railway's 88s ceiling)

Caveat

Per-class counts (sessions/probes/elements/epochs) stay 0 when stage-1 counts times out — the dataset record doesn't expose these. They display as 0 with the "X warnings" tooltip explaining the underlying cause. A future iteration could compute these from `/dataset/:id/documents?class=...&pageSize=1` envelopes (4 extra calls per degraded synthesis) — not worth it for fields the user rarely notices vs the now-restored species/brainRegions facts.

Coverage

  • `test_stage_1_counts_timeout_still_runs_stage_2_via_record_fields`: end-to-end pin asserting stage 2 attempts and succeeds despite counts timeout
  • `test_safe_record_int_handles_all_input_shapes`: defensive helper accepts any input shape and degrades to 0

Test plan

  • `pytest backend/tests`: 557 passed, 1 skipped (opentelemetry)
  • `ruff check backend/`: clean
  • `mypy backend/`: 56 source files, no issues
  • Smoke-test on Railway post-deploy: confirm Jess Haley summary returns species + brainRegions populated despite counts-timeout warning

🤖 Generated with Claude Code

…times out

Smoke-test pass after PRs #101/#102/#103 found that for the largest
production datasets (Jess Haley 78k docs, Sophie Griswold 101k docs)
the /document-class-counts endpoint still exhausts its 20s deadline
on every synthesis attempt — the cloud's Mongo aggregation on
className without proper indexes simply doesn't fit in the budget.

Net effect: every cron warm cycle landed a DEGRADED summary with all
zero counts AND null per-class facts (because counts.subjects=0
short-circuited stage 2's openminds_subject fanout). The
DatasetSummaryCard rendered "0 sessions, 0 subjects, ..., Not
applicable" everywhere, even though the dataset record itself
reports `numberOfSubjects=1656` and `documentCount=78687`.

The frontend band-aid (Waltham-Data-Science/ndi-cloud-app#91) hides
this UX-side by enriching the degraded summary with record fields
client-side. This PR is the real fix at the BACKEND: when stage-1
counts times out, synthesize a counts envelope from the dataset
record's `numberOfSubjects` + `documentCount` and gate stage 2 from
those record fields directly. Stage 2 still attempts the per-class
fanouts (openminds_subject for species/strains/sexes,
probe_location for brainRegions, element for probeTypes), bounded
by their existing 25s per-class deadlines.

## Worst-case timing

  stage 1: 20s  (counts + dataset record in parallel — dataset
                 succeeds fast; counts hits the deadline)
  stage 2: 25s  (3 classes in parallel — each bounded by per-class
                 deadline; ndiquery for openminds_subject can succeed
                 even when class-counts is slow because they hit
                 different cloud endpoints)
  ontology: ~2s
  total:    ~47s   (well under Railway's 88s ceiling)

## What the user sees, before vs after

  Before:
    counts: { sessions: 0, subjects: 0, probes: 0, elements: 0,
              epochs: 0, totalDocuments: 0 }
    species: null
    brainRegions: null
    extractionWarnings: ["class counts query failed: ..."]

  After (Jess Haley):
    counts: { sessions: 0, subjects: 1656, probes: 0, elements: 0,
              epochs: 0, totalDocuments: 78687 }
    species: [Caenorhabditis elegans, Escherichia coli]  ← FROM cloud
    brainRegions: [whole nervous system]                  ← FROM cloud
    strains/sexes: real data when available
    extractionWarnings: ["class counts query failed: ..."]
    ↑ subjects + totalDocuments come from dataset record;
      species/brainRegions/strains/sexes come from REAL stage-2
      ndiquery+bulk_fetch (the per-class queries succeed in
      isolation; only /document-class-counts is pathologically slow).

## Caveat: per-class counts (sessions/probes/elements/epochs) stay 0

The dataset record doesn't expose these fields, so when stage-1
counts times out we can't populate them. They display as 0 with the
"X warnings" tooltip explaining the underlying cause.

A future iteration could optionally compute these from
`/dataset/:id/documents?class=...&pageSize=1` reading the response
envelope's `total` field, but that's 4 extra cloud calls per
degraded synthesis — not worth it for fields the user rarely
notices vs the now-restored species/brainRegions facts.

## Coverage

  - test_stage_1_counts_timeout_still_runs_stage_2_via_record_fields:
    end-to-end pin asserting that with counts timed out + record
    fields populated, stage 2 attempts and succeeds, species comes
    back populated.
  - test_safe_record_int_handles_all_input_shapes: defensive helper
    accepts any input shape (None, dict-with-null, dict-with-string,
    negative int, missing key) and degrades to 0.

## Pairs with frontend

Combined with ndi-cloud-app#91 (record-fallback enrichment) +
ndi-cloud-app#92 (progressive document loading), the user-perceived
loading experience for large datasets is now:

  • Hero band renders instantly (raw record fields)
  • Summary card renders with real subjects + totalDocuments + species
    + brainRegions within ~25-45s of cold synthesis
  • Document Explorer shows first 50 rows immediately, more as user
    scrolls
  • Subsequent viewers within 24h get sub-second cache-hit responses
    (PR #103 differential TTL holds full successes longer)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant