feat(classifier): ship model trained on real Gemma 4 26B measurements by toonight · Pull Request #1 · toonight/Mnemoscope

toonight · 2026-04-30T10:07:43Z

Summary

Replaces the 1000-row synthetic training source with a 50-row measurements.csv produced by collect_measurements.py against gemma4:26b (Q4_K_M, num_ctx=40000) running locally via Ollama. Each vault is graded on 8 cells (sizes 16K/32K × structured/shuffled × 2 needles).
Status moves 🟡 → 🟢. Random Forest wins on real data: R²=0.5827, MAE=0.1386 on a held-out 10-row split. Ridge collapses from 0.85 (synthetic) to 0.14 here — confirming the 5-factor → loss surface has interactions a linear model cannot capture.
Pipeline robustness fixes so multi-hour runs don't lose work: MMB_LLM_TIMEOUT_S env var + retry-once on timeout in replication/run.py; per-variant CSV flush + per-variant try/except in classifier/collect_measurements.py.

Reference correlations across all 50 rows

Feature	r vs `observed_loss`
`semantic_redundancy`	+0.56
`token_volume`	+0.43
`structural_coherence`	+0.30 (Chroma's structuring effect, on a real model)
`distractor_density`	+0.21
`freshness_spread`	−0.17 (no signal — the grader has no notion of mtime, expected)

The positive sign on structural_coherence is notable — it's the first time the Chroma 2025 finding (structured haystacks underperform shuffled ones) is observed on real Markdown vaults graded by an actual LLM, not on synthetic NIAH.

model.json now records grader_models and offline_rows so every published model is auditable. ONNX round-trip max delta 1.6e-08.

Test plan

train.py runs end-to-end on measurements.csv, reports per-family metrics, exports ONNX, round-trip verified
collect_measurements.py survives a single-cell timeout without aborting the run (verified: smoke3 originally crashed at variant 2; smoke5 with patches completed all 10/10)
CSV flushes after every variant — verified on smoke5 (wc -l grew incrementally)
model.json includes grader_models: ["gemma4-ctx:latest"] and offline_rows: 0
(follow-up) wire model.onnx into @mnemoscope/core via onnxruntime-node so predict_rot actually uses this model — out of scope for this PR

🤖 Generated with Claude Code

Replaces the 1000-row synthetic-large.csv training source with a 50-row measurements.csv produced by collect_measurements.py against gemma4:26b (num_ctx=40000) running locally via Ollama. Each vault is graded on 8 cells (sizes 16K/32K × structured/shuffled × 2 needles) at the middle position; observed_loss = 1 - accuracy. Status moves 🟡 → 🟢. Random Forest wins on real data (R²=0.5827, MAE=0.1386 on a held-out 10-row split) — Ridge collapses from 0.85 on the synthetic baseline to 0.14 here, confirming the 5-factor → loss surface has interactions a linear model cannot capture. Reference correlations across all 50 rows: semantic_redundancy +0.56, token_volume +0.43, structural_coherence +0.30 (Chroma's structuring effect, on a real model), distractor_density +0.21, freshness_spread −0.17 (no signal — the grader has no notion of mtime, expected). model.json now records grader_models and offline_rows so every published model is auditable. ONNX round-trip max delta 1.6e-08. Pipeline robustness: - replication/run.py: MMB_LLM_TIMEOUT_S env var (default 600s) + retry once on TimeoutError/URLError. Required for ≥32K-token prompts on a local 26B model where prompt_eval can exceed the prior 120s ceiling. - collect_measurements.py: CSV flush after every variant + try/except per variant. A single failed variant no longer aborts a multi-hour run.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bb235b6a66

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-30T10:10:00Z

+            last_err = e
+            continue
+    else:
+        return False, f"llm-error: {last_err!r}"


Raise retry-exhausted LLM errors instead of scoring them as misses

When both LLM attempts fail, grade_with_llm now returns False with an llm-error note, which is then treated by collect_measurements._measure_variant as a normal incorrect answer and folded into observed_loss. In runs with transient/network/auth/API failures (e.g., 429/5xx/invalid endpoint), this silently corrupts labels by inflating loss instead of skipping/failing the measurement, so the exported measurements.csv can look valid while containing transport errors as training targets.

Useful? React with 👍 / 👎.

…ruff B007 The retry loop in `grade_with_llm` only branches on success/failure of the urlopen call; the `attempt` index is intentionally unused. Renaming to `_attempt` documents the intent and silences ruff's B007 finding, which was the lone remaining failure on the python CI job for this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lestone The PR #1 merge brings the classifier from a synthetic-only baseline to a 50-row dataset of real (signature, observed_loss) pairs graded by gemma4:26b. README's scientific-posture table now describes the calibrated state instead of the planned-calibration state, and the CHANGELOG's [Unreleased] section captures the data-collection win, the model-family ranking flip, and the pipeline-robustness fixes that shipped alongside the data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ens) in model.json Walking the audit trail back from a published model.onnx, you used to hit a wall at the model.json's `data_source: measurements.csv` -- nothing said how that CSV had been produced, how long it took, or how many LLM tokens it cost. Adding that surface so future reviewers can verify reproducibility without reconstructing it from terminal logs. What ships: research/replication/run.py grade_with_llm now returns (correct, notes, tokens_in, tokens_out) by reading the OpenAI-compat usage block on the response. The Cell dataclass gains optional tokens_in / tokens_out fields. Offline cells leave them at None. research/classifier/collect_measurements.py Aggregates per-variant wall_clock_ms + tokens_in + tokens_out from the underlying cells and adds those three columns to the CSV. Tracks grand totals across the run (wall_clock_s overall vs grading-only, total_cells, total_tokens_input/output, variant bucket counts). Writes a sibling <out>-meta.json next to the CSV capturing all of that plus model / endpoint / config knobs. research/classifier/train.py New --collection-meta flag (with auto-discovery of <data>-meta.json sitting next to the CSV) that embeds the JSON under model.json#dataset_collection. Reviewers now see the data-collection cost alongside the trained model. research/classifier/measurements-meta.json (NEW, hand-recorded) The current measurements.csv was collected before this audit surface existed. This file records what we actually know from the PR #1 description (gemma4:26b Q4_K_M, ctx=40000, 50 variants × 8 cells = 400 cells, multi-hour local Ollama run) and explicitly nulls fields that were not captured (wall-clock seconds, total LLM tokens). Future runs will populate every field automatically. Includes a `training_cost` block making it clear that the sklearn step itself uses zero LLM tokens. research/classifier/model.json Re-trained on measurements.csv now embeds dataset_collection. RF still wins (R² = 0.5827, MAE = 0.1386), confirming the run is deterministic against a fixed seed. research/classifier/DOGFOOD.md "Ground rules" section says to commit the *-meta.json alongside the CSV; new "Audit trail per run" section documents the two artifacts the collector emits and how train.py picks the meta up. Lint + 14 Python tests + 47 Node tests still green at HEAD. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed Apr 30, 2026

View reviewed changes

toonight merged commit b687d9d into main Apr 30, 2026
2 checks passed

toonight deleted the feat/real-measurements-v0.3 branch April 30, 2026 10:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(classifier): ship model trained on real Gemma 4 26B measurements#1

feat(classifier): ship model trained on real Gemma 4 26B measurements#1
toonight merged 2 commits into
mainfrom
feat/real-measurements-v0.3

toonight commented Apr 30, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

toonight commented Apr 30, 2026

Summary

Reference correlations across all 50 rows

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant