Skip to content

lewm_hillclimb_guided: held-out test + task-specialization ceiling experiment#35

Open
cgpadwick wants to merge 16 commits into
masterfrom
fix/lewm-hillclimb-holdout-test
Open

lewm_hillclimb_guided: held-out test + task-specialization ceiling experiment#35
cgpadwick wants to merge 16 commits into
masterfrom
fix/lewm-hillclimb-holdout-test

Conversation

@cgpadwick

Copy link
Copy Markdown
Owner

Adds a held-out test to the lewm_hillclimb_guided flow (so the hill-climb no longer reports the headline on the same episodes it selected on) and builds the task-specialization ceiling experiment on top of it. All in saage — le-wm is not modified.

The bug this fixes

The guided hill-climb selected the winning recipe (keep_or_revert over N iterations) and reported the final number on the same 50-episode eval (cube.yaml, seed 42). That's test-set selection bias — the same thing flows/greenfield_ml avoids with a val/test split.

Held-out test (Option A — saage-only eval seeds)

Stock le-wm eval.py already samples eval episodes by seed, so:

  • selection evals (baseline + in-loop) → seed=val_seed (42), 50 episodes
  • headline evals → a different seed=test_seed (1234) + heavy eval.num_eval (200)

The headline is scored on an episode sample the selection loop never saw → essentially no selection bias, with zero changes to the third-party le-wm repo. (Same-pool draws can overlap by ~1–2 episodes — negligible vs the 50-episode sampling noise. We deliberately chose this over a guaranteed-disjoint split in le-wm to avoid maintaining a fork we'd never upstream.)

Specialization-ceiling experiment (approach B)

Measures how much per-task recipe tuning buys vs the paper's transferable recipe:

  • paper-recipe headline — retrain the paper recipe (pristine tree, runs first) at the paper's budget, score on the held-out test → reference low.
  • specialized headline — retrain the hill-climb winner at the same budget + same held-out test.
  • gain = specialized − paper (held-out), positioned against DINO-WM = 86.
  • gain_record.py computes it; the report skill leads with the layered headline + an honesty box (n=1 task, single seed, noisy 50-ep selection metric).

Files

  • flow.yaml — paper-recipe headline steps, val_seed/test_seed/test_num_eval/paper_test_score/specialization_gain vars, gain_record step.
  • gain_record.py (new) — deterministic gain helper.
  • clean_ckpt.py — allow lewm_cube_confirm (fixes a pre-existing latent bug) + lewm_cube_paper.
  • report/skill.md — layered headline, recipe-diff, honesty box.
  • propose/proposal_critic skills — forbid touching the held-out eval seed/size.
  • tests/test_clean_ckpt_guided, test_gain_record, test_lewm_specialization_flow.
  • docs/superpowers/{specs,plans}/2026-06-27-lewm-specialization* — design spec + implementation plan (with the Option A banner).

Testing

Full suite 400 passed, 7 skipped (offline). The flow itself runs on GPU (spark-c0c0) — not exercised in CI; pilot run is the next step.

🤖 Generated with Claude Code

cgpadwick and others added 14 commits June 27, 2026 10:03
Hill-climb was selecting (keep_or_revert across 8 iterations) and reporting
the headline on the SAME eval set (cube.yaml, 50 episodes, seed 42) — test-set
selection bias, the issue flows/greenfield_ml avoids with --split val/test.

Now: baseline_eval + in-loop eval pass split=val (selection only); confirm_eval
passes split=test, evaluated once for the headline. val/test draw from disjoint
halves of the eval episodes (partitioned in le-wm eval.py by a fixed master seed).

- flow.yaml: split=val on selection evals, split=test on confirm; header +
  confirm_record log line relabeled val vs held-out test.
- report/skill.md: headline is now the held-out test score (confirm_score);
  ledger candidate/best scores relabeled as validation.

Requires the companion le-wm eval.py change (split knob).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Design for measuring how much success-rate headroom per-task recipe tuning
buys on OGBench-Cube vs the paper's transferable fixed recipe, positioned
against DINO-WM (86). Layered headline (paper-recipe vs specialized, same
budget + same held-out test), cheap val-50 selection / heavy held-out test
headline, recipe-knobs-at-fixed-budget search space. Pilot then full run on
spark-c0c0 (GB10). Builds on the held-out split work (this branch +
le-wm feat/eval-holdout-split).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8 tasks: clean_ckpt ALLOWED fix, gain_record helper, approach-B flow wiring
(paper-recipe headline + heavy held-out test + gain), report skill layered
story, proposer forbidden-list clarification, le-wm split verification script,
spark-c0c0 sync, pilot+full run. TDD where unit-testable; explicit run/ops
steps elsewhere.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ew Headline

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t strictness, doc accuracy, DINO dedup)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…wm change (Option A)

Replace the le-wm eval.py val/test split with a saage-only approach: stock
eval.py already samples episodes by `seed`, so selection evals use seed=val_seed
(42) and the two headline evals use a different seed=test_seed (1234) + heavy
eval.num_eval. The headline is scored on an episode sample the selection loop
never saw — essentially no test-set selection bias — with no modification to the
third-party le-wm repo (which we were never going to upstream). Same-pool draws
can overlap by ~1-2 episodes, negligible vs 50-episode sampling noise.

- flow.yaml: val_seed/test_seed shared vars; 4 eval invocations switched from
  split=val/test to seed=; comments + confirm_record reworded.
- report skill, propose/proposal_critic skills: split -> eval-seed wording;
  honesty box notes the small-overlap caveat.
- test_lewm_specialization_flow: assert seed-based wiring, no split= keyword.

le-wm reverted to stock main (feat/eval-holdout-split branch dropped).
Full suite 400 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Banner on the spec + plan: the held-out test no longer uses a le-wm eval.py
split; it uses val_seed/test_seed on stock eval.py. Task 6 (le-wm verify
script) obsolete; Tasks 7-8 need no le-wm sync. Authoritative impl: f36a6ee.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
a clear description of the winning experiment(s) — the configuration/approach and
key details that ended up working (read the two config YAMLs for the REAL winning
settings).
1. **Headline (one line, up front).** "Specializing LeWM on OGBench-Cube:

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think there's a need for any of this reference to the paper's best. this should be independent.


# explicit allow-list: the only directories this script may ever delete
ALLOWED = {"lewm_cube_exp", "lewm_cube_best", "lewm_smoke"}
ALLOWED = {"lewm_cube_exp", "lewm_cube_best", "lewm_smoke",

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are these directories and who creates them and why?

Comment thread contrib/lewm_hillclimb_guided/flow.yaml Outdated
confirm_epochs: 10 # final confirmation run at the paper's budget
confirm_name: lewm_cube_confirm
confirm_score: -1.0
paper_test_score: -1.0 # paper-recipe headline on the held-out TEST eval

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paper_test_score shouldn't be anywhere in this code, get rid of all instances of it

Comment thread contrib/lewm_hillclimb_guided/flow.yaml Outdated
# -> a sample the selection loop never scored
test_num_eval: 200 # held-out TEST size for the headline evals (heavy);
# in-loop val stays at cube.yaml's 50 (cheap selection)
paper_name: lewm_cube_paper # checkpoint dir for the paper-recipe headline retrain

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get rid of this

Comment thread contrib/lewm_hillclimb_guided/flow.yaml Outdated
test_num_eval: 200 # held-out TEST size for the headline evals (heavy);
# in-loop val stays at cube.yaml's 50 (cheap selection)
paper_name: lewm_cube_paper # checkpoint dir for the paper-recipe headline retrain
specialization_gain: -999.0 # seeded so the key always exists; set by gain_record

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get rid of this

@@ -0,0 +1,34 @@
#!/usr/bin/env python3

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kill this

@@ -0,0 +1,622 @@
# LeWM Task-Specialization Ceiling — Implementation Plan

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete this file

@@ -0,0 +1,206 @@
# LeWM Task-Specialization Ceiling — Experiment Design

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete this file

Comment thread tests/test_gain_record.py Outdated
@@ -0,0 +1,32 @@
"""gain_record helper: computes specialized-minus-paper gain, logs it."""

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

Comment thread tests/test_lewm_specialization_flow.py Outdated
@@ -0,0 +1,58 @@
"""Structural checks on the specialization flow.yaml."""
from pathlib import Path

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update after changes

cgpadwick and others added 2 commits June 28, 2026 19:07
…, keep held-out test only

Per PR #35 review: the flow should stand on its own numbers, not a
paper-recipe-vs-specialized comparison. Remove the entire approach-B layer and
keep only the held-out test bug fix.

Removed:
- paper-recipe headline steps (paper_headline_clean/train/eval) + paper_test_score,
  paper_name shared vars
- gain_record step + gain_record.py + test_gain_record.py + specialization_gain var
- lewm_cube_paper from clean_ckpt ALLOWED
- the spec + plan docs (specialization-ceiling experiment)
- report skill: paper-recipe / DINO-WM / gain framing -> report is now independent
  (run's own best-val, held-out-test final, vs target)

Kept (the actual fix): selection evals run seed=val_seed (42); the final
confirm eval runs seed=test_seed (1234) + heavy eval.num_eval, so the reported
number is on a sample selection never scored. le-wm unchanged.

Renamed the flow test to test_lewm_holdout_flow.py (asserts seed wiring + that the
removed experiment bits stay gone). Full suite 397 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ument clean_ckpt dirs

Remove paper/DINO-WM/PLDM/random comparison numbers from the task prose and the
research-log header so the run stands on its own metric. Add a comment in
clean_ckpt explaining the ALLOWED dirs are saage's own per-run checkpoint dirs
(created by train.py/eval.py), kept apart from the user's real checkpoints.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant