perf(refine): consume _labels.json + lazy numeric probes#19
Merged
Conversation
…T index `ete manifest refine` rebuilt a full label+numeric index over the entire ground truth on every run (buildIndex), even though it only ever inspects numerics on a matched label's own row. On big models the bulk of that work indexed giant *unlabeled* grids (e.g. a PP&E schedule) the refiner never consults. Now buildIndex: - sources labels from the Rust parser's chunked/_labels.json when present (O(labels), no GT scan), falling back to buildLabelIndex(gt) for legacy engines that predate the index; - resolves same-row numerics lazily by probing the row's columns on demand (numericsForRow), memoized per row, stopping after a long empty-column run. Behavior-preserving: ranking/dedup/value-range logic is untouched, so the existing manifest/ship-ready suites stay green. The remaining full pass is the unavoidable JSON parse of the ground truth (a follow-up could lift that with a parser-emitted row-values artifact; see ROADMAP). New tests/cli/test-refine-label-index.mjs (14): correctness off _labels.json, parity between the index path and the GT-scan fallback, lazy-probe far/gapped columns + value ranges, and a consumption proof (a label present only in the index — not as a GT string — is still resolved; the fallback provably cannot). Wired into `npm test`. Measured (synthetic giant-grid GT): the eliminated buildIndex pass alone was ~1.4s on 1.4M cells / ~7.9s on 6.4M cells; new refine completes end-to-end in less time than the old index build took, and the skipped work scales with total cell count. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CHANGELOG + PLAN entries for the _labels.json consumption + lazy numeric probes. ROADMAP: mark the pre-indexed label->cell item done for refine, with Tier B (parser-emitted row-values artifact) and the searchByLabel / init single-index follow-ups called out. SKILL: note refine is faster on big models (transparent). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
You asked me to dig into the label→cell pre-index and verify how much
refinealready leverages_labels.json. The answer was: zero.manifest refinerebuilt a full label+numeric index over the entire ground truth on every run (buildIndex), even though it only ever inspects numerics on a matched label's own row. On big models the bulk of that work indexed giant unlabeled grids (e.g. a 190 MB PP&E depreciation schedule) the refiner never consults — pure waste. And_labels.json, which has existed since V4, wasn't consumed by refine at all.This PR fixes that, behavior-preserving.
What changed (
cli/commands/manifest-refine.mjs)chunked/_labels.jsonwhen present — anO(labels)read instead of scanning every cell. Legacy engines without the index fall back to a one-time GT scan (buildLabelIndex), so nothing breaks.numericsForRow, memoized): probe that row's columns on demand, stopping after a long empty-column run — instead of bucketing every numeric in a multi-million-cell workbook up front. The giant unlabeled grids are never touched.Why this is the right scope (and what it isn't)
The naive idea — "just wire
_labels.jsonin" — doesn't work alone, because_labels.jsonhas labels only, no numbers, and refine needs same-row numerics for value-range validation + base-case values. So the win comes from also making numerics lazy. The remaining cost floor is the unavoidable JSON parse of the ground truth.Two deliberate follow-ups (in ROADMAP, not this PR):
searchByLabel(query/carry), and build the GT index once perinit(today generate → refine → doctor → maps each re-parse the GT — the real source of the "~2.5 min" loop).Impact
The eliminated
buildIndexpass scales with total cell count; the new probe cost scales with matched label rows (a few dozen). Measured on a synthetic giant-grid GT: the removed pass alone was ~1.4 s (1.4 M cells) / ~7.9 s (6.4 M cells); end-to-end refine now finishes in less time than the old index build took, projecting ~11 s → ~3.6 s on a 200 MB GT and a larger relative win as the unlabeled grid grows.Tests —
tests/cli/test-refine-label-index.mjs(14), wired intonpm test_labels.json;The existing
test-artifact-slimmingsuite already runsrunInitend-to-end through the real parser, so refine consuming a real_labels.jsonis covered there too.Validation (local)
npm test(all 8 JS suites incl. the new 14) +smoke78/78 +test:depgraph11 +test:engine21 +test:slimming13 — all green. CI (added in #18) will re-run the matrix on this PR.🤖 Generated with Claude Code