perf(refine): consume _labels.json + lazy numeric probes by ebootheee · Pull Request #19 · ebootheee/excel-to-engine

ebootheee · 2026-05-28T20:49:43Z

Summary

You asked me to dig into the label→cell pre-index and verify how much refine already leverages _labels.json. The answer was: zero. manifest refine rebuilt a full label+numeric index over the entire ground truth on every run (buildIndex), even though it only ever inspects numerics on a matched label's own row. On big models the bulk of that work indexed giant unlabeled grids (e.g. a 190 MB PP&E depreciation schedule) the refiner never consults — pure waste. And _labels.json, which has existed since V4, wasn't consumed by refine at all.

This PR fixes that, behavior-preserving.

What changed (`cli/commands/manifest-refine.mjs`)

Labels from chunked/_labels.json when present — an O(labels) read instead of scanning every cell. Legacy engines without the index fall back to a one-time GT scan (buildLabelIndex), so nothing breaks.
Numerics resolved lazily, per matched row (numericsForRow, memoized): probe that row's columns on demand, stopping after a long empty-column run — instead of bucketing every numeric in a multi-million-cell workbook up front. The giant unlabeled grids are never touched.
Candidate ranking, dedup, value-range, and summary/rollup/hint logic are untouched.

Why this is the right scope (and what it isn't)

The naive idea — "just wire _labels.json in" — doesn't work alone, because _labels.json has labels only, no numbers, and refine needs same-row numerics for value-range validation + base-case values. So the win comes from also making numerics lazy. The remaining cost floor is the unavoidable JSON parse of the ground truth.

Two deliberate follow-ups (in ROADMAP, not this PR):

Tier B: a parser-emitted row-values artifact (numerics for label-bearing rows only) would let refine skip the GT parse entirely — big win on giant-grid models, ~GT-sized (no win) on dense-label models, so gate on a real-model size measurement.
Apply the same lazy-numerics path to searchByLabel (query/carry), and build the GT index once per init (today generate → refine → doctor → maps each re-parse the GT — the real source of the "~2.5 min" loop).

Impact

The eliminated buildIndex pass scales with total cell count; the new probe cost scales with matched label rows (a few dozen). Measured on a synthetic giant-grid GT: the removed pass alone was ~1.4 s (1.4 M cells) / ~7.9 s (6.4 M cells); end-to-end refine now finishes in less time than the old index build took, projecting ~11 s → ~3.6 s on a 200 MB GT and a larger relative win as the unlabeled grid grows.

Tests — `tests/cli/test-refine-label-index.mjs` (14), wired into `npm test`

correctness off _labels.json;
parity — identical mappings via the index path vs. the GT-scan fallback (the optimization changes speed, not results);
lazy-probe far/gapped columns + value-range filtering;
consumption proof — a label present only in the index (not as a GT string) is still resolved, which the fallback provably cannot do.

The existing test-artifact-slimming suite already runs runInit end-to-end through the real parser, so refine consuming a real _labels.json is covered there too.

Validation (local)

npm test (all 8 JS suites incl. the new 14) + smoke 78/78 + test:depgraph 11 + test:engine 21 + test:slimming 13 — all green. CI (added in #18) will re-run the matrix on this PR.

🤖 Generated with Claude Code

…T index `ete manifest refine` rebuilt a full label+numeric index over the entire ground truth on every run (buildIndex), even though it only ever inspects numerics on a matched label's own row. On big models the bulk of that work indexed giant *unlabeled* grids (e.g. a PP&E schedule) the refiner never consults. Now buildIndex: - sources labels from the Rust parser's chunked/_labels.json when present (O(labels), no GT scan), falling back to buildLabelIndex(gt) for legacy engines that predate the index; - resolves same-row numerics lazily by probing the row's columns on demand (numericsForRow), memoized per row, stopping after a long empty-column run. Behavior-preserving: ranking/dedup/value-range logic is untouched, so the existing manifest/ship-ready suites stay green. The remaining full pass is the unavoidable JSON parse of the ground truth (a follow-up could lift that with a parser-emitted row-values artifact; see ROADMAP). New tests/cli/test-refine-label-index.mjs (14): correctness off _labels.json, parity between the index path and the GT-scan fallback, lazy-probe far/gapped columns + value ranges, and a consumption proof (a label present only in the index — not as a GT string — is still resolved; the fallback provably cannot). Wired into `npm test`. Measured (synthetic giant-grid GT): the eliminated buildIndex pass alone was ~1.4s on 1.4M cells / ~7.9s on 6.4M cells; new refine completes end-to-end in less time than the old index build took, and the skipped work scales with total cell count. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

CHANGELOG + PLAN entries for the _labels.json consumption + lazy numeric probes. ROADMAP: mark the pre-indexed label->cell item done for refine, with Tier B (parser-emitted row-values artifact) and the searchByLabel / init single-index follow-ups called out. SKILL: note refine is faster on big models (transparent). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ebootheee and others added 3 commits May 28, 2026 14:46

Merge remote-tracking branch 'origin/main' into feat/refine-label-index

c1ff338

ebootheee merged commit 480d27e into main May 28, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(refine): consume _labels.json + lazy numeric probes#19

perf(refine): consume _labels.json + lazy numeric probes#19
ebootheee merged 3 commits into
mainfrom
feat/refine-label-index

ebootheee commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ebootheee commented May 28, 2026

Summary

What changed (cli/commands/manifest-refine.mjs)

Why this is the right scope (and what it isn't)

Impact

Tests — tests/cli/test-refine-label-index.mjs (14), wired into npm test

Validation (local)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

What changed (`cli/commands/manifest-refine.mjs`)

Tests — `tests/cli/test-refine-label-index.mjs` (14), wired into `npm test`