feat(classifier): ship model trained on real Gemma 4 26B measurements#1
Conversation
Replaces the 1000-row synthetic-large.csv training source with a 50-row measurements.csv produced by collect_measurements.py against gemma4:26b (num_ctx=40000) running locally via Ollama. Each vault is graded on 8 cells (sizes 16K/32K × structured/shuffled × 2 needles) at the middle position; observed_loss = 1 - accuracy. Status moves 🟡 → 🟢. Random Forest wins on real data (R²=0.5827, MAE=0.1386 on a held-out 10-row split) — Ridge collapses from 0.85 on the synthetic baseline to 0.14 here, confirming the 5-factor → loss surface has interactions a linear model cannot capture. Reference correlations across all 50 rows: semantic_redundancy +0.56, token_volume +0.43, structural_coherence +0.30 (Chroma's structuring effect, on a real model), distractor_density +0.21, freshness_spread −0.17 (no signal — the grader has no notion of mtime, expected). model.json now records grader_models and offline_rows so every published model is auditable. ONNX round-trip max delta 1.6e-08. Pipeline robustness: - replication/run.py: MMB_LLM_TIMEOUT_S env var (default 600s) + retry once on TimeoutError/URLError. Required for ≥32K-token prompts on a local 26B model where prompt_eval can exceed the prior 120s ceiling. - collect_measurements.py: CSV flush after every variant + try/except per variant. A single failed variant no longer aborts a multi-hour run.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bb235b6a66
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| last_err = e | ||
| continue | ||
| else: | ||
| return False, f"llm-error: {last_err!r}" |
There was a problem hiding this comment.
Raise retry-exhausted LLM errors instead of scoring them as misses
When both LLM attempts fail, grade_with_llm now returns False with an llm-error note, which is then treated by collect_measurements._measure_variant as a normal incorrect answer and folded into observed_loss. In runs with transient/network/auth/API failures (e.g., 429/5xx/invalid endpoint), this silently corrupts labels by inflating loss instead of skipping/failing the measurement, so the exported measurements.csv can look valid while containing transport errors as training targets.
Useful? React with 👍 / 👎.
…ruff B007 The retry loop in `grade_with_llm` only branches on success/failure of the urlopen call; the `attempt` index is intentionally unused. Renaming to `_attempt` documents the intent and silences ruff's B007 finding, which was the lone remaining failure on the python CI job for this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lestone The PR #1 merge brings the classifier from a synthetic-only baseline to a 50-row dataset of real (signature, observed_loss) pairs graded by gemma4:26b. README's scientific-posture table now describes the calibrated state instead of the planned-calibration state, and the CHANGELOG's [Unreleased] section captures the data-collection win, the model-family ranking flip, and the pipeline-robustness fixes that shipped alongside the data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ens) in model.json
Walking the audit trail back from a published model.onnx, you used to
hit a wall at the model.json's `data_source: measurements.csv` --
nothing said how that CSV had been produced, how long it took, or how
many LLM tokens it cost. Adding that surface so future reviewers can
verify reproducibility without reconstructing it from terminal logs.
What ships:
research/replication/run.py
grade_with_llm now returns (correct, notes, tokens_in, tokens_out)
by reading the OpenAI-compat usage block on the response. The Cell
dataclass gains optional tokens_in / tokens_out fields. Offline
cells leave them at None.
research/classifier/collect_measurements.py
Aggregates per-variant wall_clock_ms + tokens_in + tokens_out from
the underlying cells and adds those three columns to the CSV.
Tracks grand totals across the run (wall_clock_s overall vs
grading-only, total_cells, total_tokens_input/output, variant
bucket counts). Writes a sibling <out>-meta.json next to the CSV
capturing all of that plus model / endpoint / config knobs.
research/classifier/train.py
New --collection-meta flag (with auto-discovery of <data>-meta.json
sitting next to the CSV) that embeds the JSON under
model.json#dataset_collection. Reviewers now see the data-collection
cost alongside the trained model.
research/classifier/measurements-meta.json (NEW, hand-recorded)
The current measurements.csv was collected before this audit
surface existed. This file records what we actually know from the
PR #1 description (gemma4:26b Q4_K_M, ctx=40000, 50 variants × 8
cells = 400 cells, multi-hour local Ollama run) and explicitly
nulls fields that were not captured (wall-clock seconds, total
LLM tokens). Future runs will populate every field automatically.
Includes a `training_cost` block making it clear that the
sklearn step itself uses zero LLM tokens.
research/classifier/model.json
Re-trained on measurements.csv now embeds dataset_collection. RF
still wins (R² = 0.5827, MAE = 0.1386), confirming the run is
deterministic against a fixed seed.
research/classifier/DOGFOOD.md
"Ground rules" section says to commit the *-meta.json alongside
the CSV; new "Audit trail per run" section documents the two
artifacts the collector emits and how train.py picks the meta up.
Lint + 14 Python tests + 47 Node tests still green at HEAD.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
measurements.csvproduced bycollect_measurements.pyagainstgemma4:26b(Q4_K_M,num_ctx=40000) running locally via Ollama. Each vault is graded on 8 cells (sizes 16K/32K × structured/shuffled × 2 needles).MMB_LLM_TIMEOUT_Senv var + retry-once on timeout inreplication/run.py; per-variant CSV flush + per-variant try/except inclassifier/collect_measurements.py.Reference correlations across all 50 rows
observed_losssemantic_redundancytoken_volumestructural_coherencedistractor_densityfreshness_spreadThe positive sign on
structural_coherenceis notable — it's the first time the Chroma 2025 finding (structured haystacks underperform shuffled ones) is observed on real Markdown vaults graded by an actual LLM, not on synthetic NIAH.model.jsonnow recordsgrader_modelsandoffline_rowsso every published model is auditable. ONNX round-trip max delta1.6e-08.Test plan
train.pyruns end-to-end onmeasurements.csv, reports per-family metrics, exports ONNX, round-trip verifiedcollect_measurements.pysurvives a single-cell timeout without aborting the run (verified: smoke3 originally crashed at variant 2; smoke5 with patches completed all 10/10)wc -lgrew incrementally)model.jsonincludesgrader_models: ["gemma4-ctx:latest"]andoffline_rows: 0model.onnxinto@mnemoscope/coreviaonnxruntime-nodesopredict_rotactually uses this model — out of scope for this PR🤖 Generated with Claude Code