Skip to content

feat(classifier): ship model trained on real Gemma 4 26B measurements#1

Merged
toonight merged 2 commits into
mainfrom
feat/real-measurements-v0.3
Apr 30, 2026
Merged

feat(classifier): ship model trained on real Gemma 4 26B measurements#1
toonight merged 2 commits into
mainfrom
feat/real-measurements-v0.3

Conversation

@toonight
Copy link
Copy Markdown
Owner

Summary

  • Replaces the 1000-row synthetic training source with a 50-row measurements.csv produced by collect_measurements.py against gemma4:26b (Q4_K_M, num_ctx=40000) running locally via Ollama. Each vault is graded on 8 cells (sizes 16K/32K × structured/shuffled × 2 needles).
  • Status moves 🟡 → 🟢. Random Forest wins on real data: R²=0.5827, MAE=0.1386 on a held-out 10-row split. Ridge collapses from 0.85 (synthetic) to 0.14 here — confirming the 5-factor → loss surface has interactions a linear model cannot capture.
  • Pipeline robustness fixes so multi-hour runs don't lose work: MMB_LLM_TIMEOUT_S env var + retry-once on timeout in replication/run.py; per-variant CSV flush + per-variant try/except in classifier/collect_measurements.py.

Reference correlations across all 50 rows

Feature r vs observed_loss
semantic_redundancy +0.56
token_volume +0.43
structural_coherence +0.30 (Chroma's structuring effect, on a real model)
distractor_density +0.21
freshness_spread −0.17 (no signal — the grader has no notion of mtime, expected)

The positive sign on structural_coherence is notable — it's the first time the Chroma 2025 finding (structured haystacks underperform shuffled ones) is observed on real Markdown vaults graded by an actual LLM, not on synthetic NIAH.

model.json now records grader_models and offline_rows so every published model is auditable. ONNX round-trip max delta 1.6e-08.

Test plan

  • train.py runs end-to-end on measurements.csv, reports per-family metrics, exports ONNX, round-trip verified
  • collect_measurements.py survives a single-cell timeout without aborting the run (verified: smoke3 originally crashed at variant 2; smoke5 with patches completed all 10/10)
  • CSV flushes after every variant — verified on smoke5 (wc -l grew incrementally)
  • model.json includes grader_models: ["gemma4-ctx:latest"] and offline_rows: 0
  • (follow-up) wire model.onnx into @mnemoscope/core via onnxruntime-node so predict_rot actually uses this model — out of scope for this PR

🤖 Generated with Claude Code

Replaces the 1000-row synthetic-large.csv training source with a 50-row
measurements.csv produced by collect_measurements.py against gemma4:26b
(num_ctx=40000) running locally via Ollama. Each vault is graded on 8
cells (sizes 16K/32K × structured/shuffled × 2 needles) at the middle
position; observed_loss = 1 - accuracy.

Status moves 🟡 → 🟢. Random Forest wins on real data
(R²=0.5827, MAE=0.1386 on a held-out 10-row split) — Ridge collapses
from 0.85 on the synthetic baseline to 0.14 here, confirming the 5-factor
→ loss surface has interactions a linear model cannot capture. Reference
correlations across all 50 rows: semantic_redundancy +0.56,
token_volume +0.43, structural_coherence +0.30 (Chroma's structuring
effect, on a real model), distractor_density +0.21, freshness_spread
−0.17 (no signal — the grader has no notion of mtime, expected).

model.json now records grader_models and offline_rows so every published
model is auditable. ONNX round-trip max delta 1.6e-08.

Pipeline robustness:
- replication/run.py: MMB_LLM_TIMEOUT_S env var (default 600s) + retry
  once on TimeoutError/URLError. Required for ≥32K-token prompts on a
  local 26B model where prompt_eval can exceed the prior 120s ceiling.
- collect_measurements.py: CSV flush after every variant + try/except
  per variant. A single failed variant no longer aborts a multi-hour run.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bb235b6a66

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

last_err = e
continue
else:
return False, f"llm-error: {last_err!r}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Raise retry-exhausted LLM errors instead of scoring them as misses

When both LLM attempts fail, grade_with_llm now returns False with an llm-error note, which is then treated by collect_measurements._measure_variant as a normal incorrect answer and folded into observed_loss. In runs with transient/network/auth/API failures (e.g., 429/5xx/invalid endpoint), this silently corrupts labels by inflating loss instead of skipping/failing the measurement, so the exported measurements.csv can look valid while containing transport errors as training targets.

Useful? React with 👍 / 👎.

…ruff B007

The retry loop in `grade_with_llm` only branches on success/failure of
the urlopen call; the `attempt` index is intentionally unused. Renaming
to `_attempt` documents the intent and silences ruff's B007 finding,
which was the lone remaining failure on the python CI job for this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@toonight toonight merged commit b687d9d into main Apr 30, 2026
2 checks passed
@toonight toonight deleted the feat/real-measurements-v0.3 branch April 30, 2026 10:14
toonight pushed a commit that referenced this pull request Apr 30, 2026
…lestone

The PR #1 merge brings the classifier from a synthetic-only baseline
to a 50-row dataset of real (signature, observed_loss) pairs graded by
gemma4:26b. README's scientific-posture table now describes the
calibrated state instead of the planned-calibration state, and the
CHANGELOG's [Unreleased] section captures the data-collection win, the
model-family ranking flip, and the pipeline-robustness fixes that
shipped alongside the data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
toonight pushed a commit that referenced this pull request Apr 30, 2026
…ens) in model.json

Walking the audit trail back from a published model.onnx, you used to
hit a wall at the model.json's `data_source: measurements.csv` --
nothing said how that CSV had been produced, how long it took, or how
many LLM tokens it cost. Adding that surface so future reviewers can
verify reproducibility without reconstructing it from terminal logs.

What ships:

  research/replication/run.py
    grade_with_llm now returns (correct, notes, tokens_in, tokens_out)
    by reading the OpenAI-compat usage block on the response. The Cell
    dataclass gains optional tokens_in / tokens_out fields. Offline
    cells leave them at None.

  research/classifier/collect_measurements.py
    Aggregates per-variant wall_clock_ms + tokens_in + tokens_out from
    the underlying cells and adds those three columns to the CSV.
    Tracks grand totals across the run (wall_clock_s overall vs
    grading-only, total_cells, total_tokens_input/output, variant
    bucket counts). Writes a sibling <out>-meta.json next to the CSV
    capturing all of that plus model / endpoint / config knobs.

  research/classifier/train.py
    New --collection-meta flag (with auto-discovery of <data>-meta.json
    sitting next to the CSV) that embeds the JSON under
    model.json#dataset_collection. Reviewers now see the data-collection
    cost alongside the trained model.

  research/classifier/measurements-meta.json (NEW, hand-recorded)
    The current measurements.csv was collected before this audit
    surface existed. This file records what we actually know from the
    PR #1 description (gemma4:26b Q4_K_M, ctx=40000, 50 variants × 8
    cells = 400 cells, multi-hour local Ollama run) and explicitly
    nulls fields that were not captured (wall-clock seconds, total
    LLM tokens). Future runs will populate every field automatically.
    Includes a `training_cost` block making it clear that the
    sklearn step itself uses zero LLM tokens.

  research/classifier/model.json
    Re-trained on measurements.csv now embeds dataset_collection. RF
    still wins (R² = 0.5827, MAE = 0.1386), confirming the run is
    deterministic against a fixed seed.

  research/classifier/DOGFOOD.md
    "Ground rules" section says to commit the *-meta.json alongside
    the CSV; new "Audit trail per run" section documents the two
    artifacts the collector emits and how train.py picks the meta up.

Lint + 14 Python tests + 47 Node tests still green at HEAD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant