lewm_hillclimb_guided: held-out test + task-specialization ceiling experiment by cgpadwick · Pull Request #35 · cgpadwick/saage

cgpadwick · 2026-06-29T01:51:12Z

Adds a held-out test to the lewm_hillclimb_guided flow (so the hill-climb no longer reports the headline on the same episodes it selected on) and builds the task-specialization ceiling experiment on top of it. All in saage — le-wm is not modified.

The bug this fixes

The guided hill-climb selected the winning recipe (keep_or_revert over N iterations) and reported the final number on the same 50-episode eval (cube.yaml, seed 42). That's test-set selection bias — the same thing flows/greenfield_ml avoids with a val/test split.

Held-out test (Option A — saage-only eval seeds)

Stock le-wm eval.py already samples eval episodes by seed, so:

selection evals (baseline + in-loop) → seed=val_seed (42), 50 episodes
headline evals → a different seed=test_seed (1234) + heavy eval.num_eval (200)

The headline is scored on an episode sample the selection loop never saw → essentially no selection bias, with zero changes to the third-party le-wm repo. (Same-pool draws can overlap by ~1–2 episodes — negligible vs the 50-episode sampling noise. We deliberately chose this over a guaranteed-disjoint split in le-wm to avoid maintaining a fork we'd never upstream.)

Specialization-ceiling experiment (approach B)

Measures how much per-task recipe tuning buys vs the paper's transferable recipe:

paper-recipe headline — retrain the paper recipe (pristine tree, runs first) at the paper's budget, score on the held-out test → reference low.
specialized headline — retrain the hill-climb winner at the same budget + same held-out test.
gain = specialized − paper (held-out), positioned against DINO-WM = 86.
gain_record.py computes it; the report skill leads with the layered headline + an honesty box (n=1 task, single seed, noisy 50-ep selection metric).

Files

flow.yaml — paper-recipe headline steps, val_seed/test_seed/test_num_eval/paper_test_score/specialization_gain vars, gain_record step.
gain_record.py (new) — deterministic gain helper.
clean_ckpt.py — allow lewm_cube_confirm (fixes a pre-existing latent bug) + lewm_cube_paper.
report/skill.md — layered headline, recipe-diff, honesty box.
propose/proposal_critic skills — forbid touching the held-out eval seed/size.
tests/ — test_clean_ckpt_guided, test_gain_record, test_lewm_specialization_flow.
docs/superpowers/{specs,plans}/2026-06-27-lewm-specialization* — design spec + implementation plan (with the Option A banner).

Testing

Full suite 400 passed, 7 skipped (offline). The flow itself runs on GPU (spark-c0c0) — not exercised in CI; pilot run is the next step.

🤖 Generated with Claude Code

Hill-climb was selecting (keep_or_revert across 8 iterations) and reporting the headline on the SAME eval set (cube.yaml, 50 episodes, seed 42) — test-set selection bias, the issue flows/greenfield_ml avoids with --split val/test. Now: baseline_eval + in-loop eval pass split=val (selection only); confirm_eval passes split=test, evaluated once for the headline. val/test draw from disjoint halves of the eval episodes (partitioned in le-wm eval.py by a fixed master seed). - flow.yaml: split=val on selection evals, split=test on confirm; header + confirm_record log line relabeled val vs held-out test. - report/skill.md: headline is now the held-out test score (confirm_score); ledger candidate/best scores relabeled as validation. Requires the companion le-wm eval.py change (split knob). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Design for measuring how much success-rate headroom per-task recipe tuning buys on OGBench-Cube vs the paper's transferable fixed recipe, positioned against DINO-WM (86). Layered headline (paper-recipe vs specialized, same budget + same held-out test), cheap val-50 selection / heavy held-out test headline, recipe-knobs-at-fixed-budget search space. Pilot then full run on spark-c0c0 (GB10). Builds on the held-out split work (this branch + le-wm feat/eval-holdout-split). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

8 tasks: clean_ckpt ALLOWED fix, gain_record helper, approach-B flow wiring (paper-recipe headline + heavy held-out test + gain), report skill layered story, proposer forbidden-list clarification, le-wm split verification script, spark-c0c0 sync, pilot+full run. TDD where unit-testable; explicit run/ops steps elsewhere. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…clean_ckpt

…approach B)

…ation story

…ew Headline Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…t strictness, doc accuracy, DINO dedup) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…wm change (Option A) Replace the le-wm eval.py val/test split with a saage-only approach: stock eval.py already samples episodes by `seed`, so selection evals use seed=val_seed (42) and the two headline evals use a different seed=test_seed (1234) + heavy eval.num_eval. The headline is scored on an episode sample the selection loop never saw — essentially no test-set selection bias — with no modification to the third-party le-wm repo (which we were never going to upstream). Same-pool draws can overlap by ~1-2 episodes, negligible vs 50-episode sampling noise. - flow.yaml: val_seed/test_seed shared vars; 4 eval invocations switched from split=val/test to seed=; comments + confirm_record reworded. - report skill, propose/proposal_critic skills: split -> eval-seed wording; honesty box notes the small-overlap caveat. - test_lewm_specialization_flow: assert seed-based wiring, no split= keyword. le-wm reverted to stock main (feat/eval-holdout-split branch dropped). Full suite 400 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Banner on the spec + plan: the held-out test no longer uses a le-wm eval.py split; it uses val_seed/test_seed on stock eval.py. Task 6 (le-wm verify script) obsolete; Tasks 7-8 need no le-wm sync. Authoritative impl: f36a6ee. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…Option A review minor)

…e diff dump)

cgpadwick · 2026-06-29T01:53:26Z

-   a clear description of the winning experiment(s) — the configuration/approach and
-   key details that ended up working (read the two config YAMLs for the REAL winning
-   settings).
+1. **Headline (one line, up front).** "Specializing LeWM on OGBench-Cube:


i don't think there's a need for any of this reference to the paper's best. this should be independent.

cgpadwick · 2026-06-29T01:54:19Z


 # explicit allow-list: the only directories this script may ever delete
-ALLOWED = {"lewm_cube_exp", "lewm_cube_best", "lewm_smoke"}
+ALLOWED = {"lewm_cube_exp", "lewm_cube_best", "lewm_smoke",


what are these directories and who creates them and why?

cgpadwick · 2026-06-29T01:55:05Z

  confirm_epochs: 10             # final confirmation run at the paper's budget
  confirm_name: lewm_cube_confirm
  confirm_score: -1.0
+  paper_test_score: -1.0         # paper-recipe headline on the held-out TEST eval


paper_test_score shouldn't be anywhere in this code, get rid of all instances of it

cgpadwick · 2026-06-29T01:55:28Z

+                                 # -> a sample the selection loop never scored
+  test_num_eval: 200             # held-out TEST size for the headline evals (heavy);
+                                 # in-loop val stays at cube.yaml's 50 (cheap selection)
+  paper_name: lewm_cube_paper    # checkpoint dir for the paper-recipe headline retrain


get rid of this

cgpadwick · 2026-06-29T01:55:38Z

+  test_num_eval: 200             # held-out TEST size for the headline evals (heavy);
+                                 # in-loop val stays at cube.yaml's 50 (cheap selection)
+  paper_name: lewm_cube_paper    # checkpoint dir for the paper-recipe headline retrain
+  specialization_gain: -999.0    # seeded so the key always exists; set by gain_record


get rid of this

cgpadwick · 2026-06-29T01:57:07Z

@@ -0,0 +1,34 @@
+#!/usr/bin/env python3


cgpadwick · 2026-06-29T01:57:29Z

@@ -0,0 +1,622 @@
+# LeWM Task-Specialization Ceiling — Implementation Plan


delete this file

cgpadwick · 2026-06-29T01:57:54Z

@@ -0,0 +1,206 @@
+# LeWM Task-Specialization Ceiling — Experiment Design


delete this file

cgpadwick · 2026-06-29T01:58:12Z

@@ -0,0 +1,32 @@
+"""gain_record helper: computes specialized-minus-paper gain, logs it."""


cgpadwick · 2026-06-29T01:58:23Z

@@ -0,0 +1,58 @@
+"""Structural checks on the specialization flow.yaml."""
+from pathlib import Path


update after changes

…, keep held-out test only Per PR #35 review: the flow should stand on its own numbers, not a paper-recipe-vs-specialized comparison. Remove the entire approach-B layer and keep only the held-out test bug fix. Removed: - paper-recipe headline steps (paper_headline_clean/train/eval) + paper_test_score, paper_name shared vars - gain_record step + gain_record.py + test_gain_record.py + specialization_gain var - lewm_cube_paper from clean_ckpt ALLOWED - the spec + plan docs (specialization-ceiling experiment) - report skill: paper-recipe / DINO-WM / gain framing -> report is now independent (run's own best-val, held-out-test final, vs target) Kept (the actual fix): selection evals run seed=val_seed (42); the final confirm eval runs seed=test_seed (1234) + heavy eval.num_eval, so the reported number is on a sample selection never scored. le-wm unchanged. Renamed the flow test to test_lewm_holdout_flow.py (asserts seed wiring + that the removed experiment bits stay gone). Full suite 397 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ument clean_ckpt dirs Remove paper/DINO-WM/PLDM/random comparison numbers from the task prose and the research-log header so the run stands on its own metric. Add a comment in clean_ckpt explaining the ALLOWED dirs are saage's own per-run checkpoint dirs (created by train.py/eval.py), kept apart from the user's real checkpoints.

cgpadwick and others added 14 commits June 27, 2026 10:03

fix(lewm_hillclimb_guided): allow confirm + paper checkpoint dirs in …

e894a95

…clean_ckpt

feat(lewm_hillclimb_guided): gain_record helper for specialization gain

ea40bea

feat(lewm_hillclimb_guided): symmetric paper-recipe headline + gain (…

552b6f2

…approach B)

docs(lewm_hillclimb_guided): report skill tells the layered specializ…

2cb04b5

…ation story

docs(lewm_hillclimb_guided): de-duplicate report Outcome section vs n…

2d80ab6

…ew Headline Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs(lewm_hillclimb_guided): forbid tuning the held-out test set

c8c88d0

chore(lewm_hillclimb_guided): final-review minors (unused import, tes…

60ad64b

…t strictness, doc accuracy, DINO dedup) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test(lewm_hillclimb_guided): assert in-loop eval also uses val seed (…

de99a37

…Option A review minor)

chore: drop stray files accidentally committed (demo note.md, worktre…

28f675f

…e diff dump)

cgpadwick commented Jun 29, 2026

View reviewed changes

cgpadwick and others added 2 commits June 28, 2026 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lewm_hillclimb_guided: held-out test + task-specialization ceiling experiment#35

lewm_hillclimb_guided: held-out test + task-specialization ceiling experiment#35
cgpadwick wants to merge 16 commits into
masterfrom
fix/lewm-hillclimb-holdout-test

cgpadwick commented Jun 29, 2026

Uh oh!

cgpadwick Jun 29, 2026

Uh oh!

cgpadwick Jun 29, 2026

Uh oh!

cgpadwick Jun 29, 2026

Uh oh!

cgpadwick Jun 29, 2026

Uh oh!

cgpadwick Jun 29, 2026

Uh oh!

cgpadwick Jun 29, 2026

Uh oh!

cgpadwick Jun 29, 2026

Uh oh!

cgpadwick Jun 29, 2026

Uh oh!

cgpadwick Jun 29, 2026

Uh oh!

cgpadwick Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -0,0 +1,622 @@
		# LeWM Task-Specialization Ceiling — Implementation Plan

		@@ -0,0 +1,206 @@
		# LeWM Task-Specialization Ceiling — Experiment Design

		@@ -0,0 +1,32 @@
		"""gain_record helper: computes specialized-minus-paper gain, logs it."""

		@@ -0,0 +1,58 @@
		"""Structural checks on the specialization flow.yaml."""
		from pathlib import Path

Conversation

cgpadwick commented Jun 29, 2026

The bug this fixes

Held-out test (Option A — saage-only eval seeds)

Specialization-ceiling experiment (approach B)

Files

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant