Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,47 @@
# excel-to-engine — Changelog

## 2026-05-28 — refine consumes `_labels.json` + lazy numeric probes

`ete manifest refine` rebuilt a full label+numeric index over the **entire**
ground truth on every run (`buildIndex`), even though it only ever inspects
numerics on a *matched label's own row*. On big models the bulk of that work
indexed giant **unlabeled** grids (e.g. a 190 MB PP&E depreciation schedule)
that the refiner never consults — pure waste. (Investigation also found refine
did **not** consume the parser's `_labels.json` at all, despite that index
existing since V4.)

### What changed

- **Labels now come from `chunked/_labels.json`** when the parser emitted it —
an O(labels) read instead of scanning every cell. Legacy engines without the
index fall back to a one-time GT scan (`buildLabelIndex`), so nothing breaks.
- **Numerics are resolved lazily, per matched row**, by probing that row's
columns on demand (`numericsForRow`, memoized) — instead of bucketing every
numeric in a multi-million-cell workbook up front. The giant unlabeled grids
are never touched.
- **Behavior-preserving:** the candidate ranking, dedup, value-range, and
summary/rollup/hint logic are untouched. The full manifest + ship-ready
suites stay green.

### Impact

The eliminated `buildIndex` pass scales with *total* cell count; the new probe
cost scales with *matched label rows* (a few dozen). On a synthetic giant-grid
ground truth the removed pass alone was ~1.4 s (1.4 M cells) / ~7.9 s (6.4 M
cells); end-to-end refine now finishes in less time than the old index build
took. The remaining floor is the unavoidable JSON parse of the ground truth — a
follow-up could lift that with a parser-emitted row-values artifact (see
ROADMAP), and the same lazy-numerics treatment could be extended to
`searchByLabel` (the `query` / `carry` path).

### Tests

- `tests/cli/test-refine-label-index.mjs` (14), wired into `npm test`:
correctness off `_labels.json`; **parity** between the index path and the
GT-scan fallback; lazy-probe far/gapped columns + value ranges; and a
**consumption proof** — a label present only in the index (not as a GT
string) is still resolved, which the fallback provably cannot do.

## 2026-05-28 — Continuous integration (GitHub Actions)

The test suite is now substantial (132 JS assertions across 7 suites, plus the
Expand Down
14 changes: 14 additions & 0 deletions PLAN.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,19 @@
# excel-to-engine — Plan

## Status: refine label-index optimization — landed 2026-05-28

`ete manifest refine` now sources labels from the parser's `_labels.json`
(O(labels), no full GT scan) and resolves same-row numerics lazily by probing,
instead of bucketing every numeric in the workbook up front (`buildIndex`). The
giant unlabeled grids that dominate big models — the very thing that made refine
slow — are no longer touched. Behavior-preserving (rankings unchanged; suites
green). New `tests/cli/test-refine-label-index.mjs` (14) proves consumption +
parity. The remaining cost floor is the ground-truth JSON parse; lifting that
would need a parser-emitted row-values artifact (Tier B). The same lazy-numerics
treatment is still open for `searchByLabel` (the `query`/`carry` path), and the
per-command GT re-parse multiplier in `init` (generate → refine → doctor → maps
each reload the GT) remains a separate follow-up.

## Status: Continuous integration — landed 2026-05-28

`.github/workflows/ci.yml` runs the full test matrix (Rust build + 11 unit
Expand Down
18 changes: 15 additions & 3 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,9 +76,21 @@ when we next touch the monitor server or auth surface.
### Manifest Refinement (continuing)
- Model-family templates — recognize a family by its sheet signature and pick
known cells directly (summary tabs, promote tab, etc.).
- Pre-indexed label→cell map built once during parsing (the session log noted
`manifest refine` took 2.5 min CPU on a 200 MB ground truth; a pre-index
from the Rust parser would cut this 10–100×).
- Pre-indexed label→cell map.
- **Done (2026-05-28):** `ete manifest refine` now consumes the parser's
`chunked/_labels.json` for labels (it previously ignored it and rebuilt a
full label+numeric index over the whole GT) and resolves same-row numerics
lazily by probing — so it no longer indexes the giant unlabeled grids that
dominate big models. The removed `buildIndex` pass was ~7.9 s on a 6.4 M-cell
GT; the work skipped scales with total cell count. `test-refine-label-index`.
- **Still open (Tier B):** the remaining floor is the ground-truth JSON parse.
A parser-emitted *row-values* artifact (numerics for label-bearing rows
only) would let refine skip the GT entirely — a large win on giant-grid
models, ~GT-sized (no win) on dense-label models, so gate it on a
real-model size measurement first.
- **Still open:** apply the same lazy-numerics path to `searchByLabel`
(`query` / `carry`), and build the GT index *once* per `init` so
generate → refine → doctor → maps stop each re-parsing it.
- Manifest migration tooling for model updates (vN → vN+1 shape diff).

---
Expand Down
109 changes: 82 additions & 27 deletions cli/commands/manifest-refine.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,10 @@

import { readFileSync, writeFileSync, existsSync } from 'fs';
import { join } from 'path';
import { loadManifest, loadGroundTruth, resolveCell, MANIFEST_VERSION } from '../../lib/manifest.mjs';
import {
loadManifest, loadGroundTruth, resolveCell, MANIFEST_VERSION,
loadLabelIndex, buildLabelIndex,
} from '../../lib/manifest.mjs';

// ---------------------------------------------------------------------------
// Required fields and their search strategies
Expand Down Expand Up @@ -100,34 +103,73 @@ const REQUIRED_FIELDS = [
},
];

// Excel's hard column ceiling (XFD = 16384). numericsForRow probes a row's
// columns left-to-right and stops after this many consecutive empty columns —
// generous enough to span any realistic financial layout (and far-right
// restated copies lose to the canonical leftmost cell in ranking anyway), while
// bounding the probe cost on a label-only row to a few hundred hash lookups.
const MAX_PROBE_COL = 16384;
const MAX_PROBE_GAP = 256;

/**
* Build a pre-index of the ground truth for fast searching.
* Groups string labels by sheet+row and numeric values by sheet+row.
* Build a search index over the ground truth.
*
* Labels come from the Rust parser's pre-built index (`chunked/_labels.json`)
* when present — an O(labels) read instead of scanning every cell — and fall
* back to a one-time ground-truth scan (`buildLabelIndex`) for legacy engines
* that predate the index.
*
* Numeric values are resolved **lazily, per matched row**, by direct probing
* (see `numericsForRow`). The refiner only ever inspects numerics on a label's
* own row, so the old approach — bucketing every numeric in a multi-million-cell
* workbook up front — was almost entirely wasted: on a big model the bulk of
* those cells live in giant *unlabeled* grids (e.g. a PP&E depreciation
* schedule) the refiner never consults. Skipping that build is the win; the
* one remaining full pass is the unavoidable JSON parse of the ground truth.
*
* @param {Object} gt - Ground truth { addr: value }
* @param {string} [modelDir] - Model dir, for loading `_labels.json`
* @returns {{ labels: Array, numericsForRow: (sheet: string, row: number) => Array }}
*/
function buildIndex(gt) {
const labels = []; // { addr, text, sheet, col, row }
const numsByRow = {}; // "sheet!row" → [{ addr, value, col }]

for (const [addr, val] of Object.entries(gt)) {
const bang = addr.lastIndexOf('!');
if (bang < 0) continue;
const sheet = addr.substring(0, bang);
const cellPart = addr.substring(bang + 1);
const match = cellPart.match(/^([A-Z]+)(\d+)$/);
if (!match) continue;
const col = match[1];
const row = parseInt(match[2], 10);
const rowKey = `${sheet}!${row}`;

if (typeof val === 'string' && val.length > 2 && val.length < 200) {
labels.push({ addr, text: val, sheet, col, row, rowKey });
} else if (typeof val === 'number') {
if (!numsByRow[rowKey]) numsByRow[rowKey] = [];
numsByRow[rowKey].push({ addr, value: val, col });
function buildIndex(gt, modelDir) {
const labelIndex = (modelDir && loadLabelIndex(modelDir)) || buildLabelIndex(gt);
const labels = [];
for (const entries of Object.values(labelIndex)) {
for (const e of entries) {
labels.push({
addr: `${e.sheet}!${e.col}${e.row}`,
text: e.text,
sheet: e.sheet,
col: e.col,
row: e.row,
rowKey: `${e.sheet}!${e.row}`,
});
}
}

return { labels, numsByRow };
const rowCache = new Map(); // "sheet!row" → [{ addr, value, col }]
function numericsForRow(sheet, row) {
const key = `${sheet}!${row}`;
const cached = rowCache.get(key);
if (cached) return cached;
const nums = [];
let gap = 0;
for (let c = 1; c <= MAX_PROBE_COL && gap < MAX_PROBE_GAP; c++) {
const col = numToCol(c);
const addr = `${sheet}!${col}${row}`;
const v = gt[addr];
if (typeof v === 'number') {
nums.push({ addr, value: v, col });
gap = 0;
} else {
gap++;
}
}
rowCache.set(key, nums);
return nums;
}

return { labels, numericsForRow };
}

/**
Expand All @@ -141,8 +183,9 @@ export function runManifestRefine(modelDir, args) {
const manifest = loadManifest(modelDir);
const gt = loadGroundTruth(manifest, modelDir);

// Pre-index for fast searching (single pass over GT)
const index = buildIndex(gt);
// Pre-index for fast searching. Labels come from `_labels.json` when the
// parser emitted it (no GT scan); numerics are probed lazily per matched row.
const index = buildIndex(gt, modelDir);

// Resolve refinement hints: either passed in via args.hints (used by init
// when a template has been applied), or read from a hand-edited manifest
Expand Down Expand Up @@ -279,7 +322,7 @@ function searchForFieldIndexed(index, field, opts = {}) {

// Pass 2: For each matching label, select the best same-row numeric cell.
for (const lm of labelMatches) {
const rowNums = index.numsByRow[lm.rowKey] || [];
const rowNums = index.numericsForRow(lm.sheet, lm.row);
const labelColNum = colToNum(lm.col);

const inRange = rowNums.filter(n => {
Expand Down Expand Up @@ -443,3 +486,15 @@ function colToNum(col) {
}
return num;
}

// Inverse of colToNum: 1 → "A", 26 → "Z", 27 → "AA". Used by numericsForRow to
// reconstruct cell addresses when probing a row's columns.
function numToCol(num) {
let col = '';
while (num > 0) {
const rem = (num - 1) % 26;
col = String.fromCharCode(65 + rem) + col;
num = Math.floor((num - 1) / 26);
}
return col;
}
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
"test:engine": "node pipelines/rust/tests/test-engine-runtime.mjs",
"test:depgraph": "node pipelines/rust/tests/test-dependency-graph.mjs",
"test:slimming": "node tests/cli/test-artifact-slimming.mjs",
"test": "node tests/cli/test-cli.mjs && node tests/cli/test-manifest-improvements.mjs && node tests/cli/test-manifest-maps.mjs && node tests/cli/test-ai-interface.mjs && node tests/cli/test-e2e4-fixes.mjs && node tests/cli/test-ship-ready.mjs && node tests/cli/use-case-suite.mjs"
"test": "node tests/cli/test-cli.mjs && node tests/cli/test-manifest-improvements.mjs && node tests/cli/test-manifest-maps.mjs && node tests/cli/test-refine-label-index.mjs && node tests/cli/test-ai-interface.mjs && node tests/cli/test-e2e4-fixes.mjs && node tests/cli/test-ship-ready.mjs && node tests/cli/use-case-suite.mjs"
},
"devDependencies": {}
}
6 changes: 6 additions & 0 deletions skill/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,12 @@ Silently falls through to a normal parse if `chunked/_ground-truth.json` is
missing — safe to default on when iterating. Turns the tighten-the-manifest
loop from minutes to seconds.

The refine step inside that loop is also faster on big models: it reads labels
from the parser's `chunked/_labels.json` and probes only the matched rows for
values, instead of indexing every cell (it used to scan the whole ground truth,
including giant unlabeled grids it never consults). Transparent — same command,
same result.

**Default output is slim.** `ete init` drops the large debug/intermediate
artifacts (`dependency-graph.json`, `_graph.json`, root `model-map.json`) once
the dependency closures are baked into `named-outputs.json` / `named-inputs.json`.
Expand Down
Loading
Loading