lewm_hillclimb_guided: held-out test + task-specialization ceiling experiment#35
Open
cgpadwick wants to merge 16 commits into
Open
lewm_hillclimb_guided: held-out test + task-specialization ceiling experiment#35cgpadwick wants to merge 16 commits into
cgpadwick wants to merge 16 commits into
Conversation
Hill-climb was selecting (keep_or_revert across 8 iterations) and reporting the headline on the SAME eval set (cube.yaml, 50 episodes, seed 42) — test-set selection bias, the issue flows/greenfield_ml avoids with --split val/test. Now: baseline_eval + in-loop eval pass split=val (selection only); confirm_eval passes split=test, evaluated once for the headline. val/test draw from disjoint halves of the eval episodes (partitioned in le-wm eval.py by a fixed master seed). - flow.yaml: split=val on selection evals, split=test on confirm; header + confirm_record log line relabeled val vs held-out test. - report/skill.md: headline is now the held-out test score (confirm_score); ledger candidate/best scores relabeled as validation. Requires the companion le-wm eval.py change (split knob). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Design for measuring how much success-rate headroom per-task recipe tuning buys on OGBench-Cube vs the paper's transferable fixed recipe, positioned against DINO-WM (86). Layered headline (paper-recipe vs specialized, same budget + same held-out test), cheap val-50 selection / heavy held-out test headline, recipe-knobs-at-fixed-budget search space. Pilot then full run on spark-c0c0 (GB10). Builds on the held-out split work (this branch + le-wm feat/eval-holdout-split). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8 tasks: clean_ckpt ALLOWED fix, gain_record helper, approach-B flow wiring (paper-recipe headline + heavy held-out test + gain), report skill layered story, proposer forbidden-list clarification, le-wm split verification script, spark-c0c0 sync, pilot+full run. TDD where unit-testable; explicit run/ops steps elsewhere. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ew Headline Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t strictness, doc accuracy, DINO dedup) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…wm change (Option A) Replace the le-wm eval.py val/test split with a saage-only approach: stock eval.py already samples episodes by `seed`, so selection evals use seed=val_seed (42) and the two headline evals use a different seed=test_seed (1234) + heavy eval.num_eval. The headline is scored on an episode sample the selection loop never saw — essentially no test-set selection bias — with no modification to the third-party le-wm repo (which we were never going to upstream). Same-pool draws can overlap by ~1-2 episodes, negligible vs 50-episode sampling noise. - flow.yaml: val_seed/test_seed shared vars; 4 eval invocations switched from split=val/test to seed=; comments + confirm_record reworded. - report skill, propose/proposal_critic skills: split -> eval-seed wording; honesty box notes the small-overlap caveat. - test_lewm_specialization_flow: assert seed-based wiring, no split= keyword. le-wm reverted to stock main (feat/eval-holdout-split branch dropped). Full suite 400 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Banner on the spec + plan: the held-out test no longer uses a le-wm eval.py split; it uses val_seed/test_seed on stock eval.py. Task 6 (le-wm verify script) obsolete; Tasks 7-8 need no le-wm sync. Authoritative impl: f36a6ee. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Option A review minor)
cgpadwick
commented
Jun 29, 2026
| a clear description of the winning experiment(s) — the configuration/approach and | ||
| key details that ended up working (read the two config YAMLs for the REAL winning | ||
| settings). | ||
| 1. **Headline (one line, up front).** "Specializing LeWM on OGBench-Cube: |
Owner
Author
There was a problem hiding this comment.
i don't think there's a need for any of this reference to the paper's best. this should be independent.
|
|
||
| # explicit allow-list: the only directories this script may ever delete | ||
| ALLOWED = {"lewm_cube_exp", "lewm_cube_best", "lewm_smoke"} | ||
| ALLOWED = {"lewm_cube_exp", "lewm_cube_best", "lewm_smoke", |
Owner
Author
There was a problem hiding this comment.
what are these directories and who creates them and why?
| confirm_epochs: 10 # final confirmation run at the paper's budget | ||
| confirm_name: lewm_cube_confirm | ||
| confirm_score: -1.0 | ||
| paper_test_score: -1.0 # paper-recipe headline on the held-out TEST eval |
Owner
Author
There was a problem hiding this comment.
paper_test_score shouldn't be anywhere in this code, get rid of all instances of it
| # -> a sample the selection loop never scored | ||
| test_num_eval: 200 # held-out TEST size for the headline evals (heavy); | ||
| # in-loop val stays at cube.yaml's 50 (cheap selection) | ||
| paper_name: lewm_cube_paper # checkpoint dir for the paper-recipe headline retrain |
| test_num_eval: 200 # held-out TEST size for the headline evals (heavy); | ||
| # in-loop val stays at cube.yaml's 50 (cheap selection) | ||
| paper_name: lewm_cube_paper # checkpoint dir for the paper-recipe headline retrain | ||
| specialization_gain: -999.0 # seeded so the key always exists; set by gain_record |
| @@ -0,0 +1,34 @@ | |||
| #!/usr/bin/env python3 | |||
| @@ -0,0 +1,622 @@ | |||
| # LeWM Task-Specialization Ceiling — Implementation Plan | |||
| @@ -0,0 +1,206 @@ | |||
| # LeWM Task-Specialization Ceiling — Experiment Design | |||
| @@ -0,0 +1,32 @@ | |||
| """gain_record helper: computes specialized-minus-paper gain, logs it.""" | |||
| @@ -0,0 +1,58 @@ | |||
| """Structural checks on the specialization flow.yaml.""" | |||
| from pathlib import Path | |||
…, keep held-out test only Per PR #35 review: the flow should stand on its own numbers, not a paper-recipe-vs-specialized comparison. Remove the entire approach-B layer and keep only the held-out test bug fix. Removed: - paper-recipe headline steps (paper_headline_clean/train/eval) + paper_test_score, paper_name shared vars - gain_record step + gain_record.py + test_gain_record.py + specialization_gain var - lewm_cube_paper from clean_ckpt ALLOWED - the spec + plan docs (specialization-ceiling experiment) - report skill: paper-recipe / DINO-WM / gain framing -> report is now independent (run's own best-val, held-out-test final, vs target) Kept (the actual fix): selection evals run seed=val_seed (42); the final confirm eval runs seed=test_seed (1234) + heavy eval.num_eval, so the reported number is on a sample selection never scored. le-wm unchanged. Renamed the flow test to test_lewm_holdout_flow.py (asserts seed wiring + that the removed experiment bits stay gone). Full suite 397 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ument clean_ckpt dirs Remove paper/DINO-WM/PLDM/random comparison numbers from the task prose and the research-log header so the run stands on its own metric. Add a comment in clean_ckpt explaining the ALLOWED dirs are saage's own per-run checkpoint dirs (created by train.py/eval.py), kept apart from the user's real checkpoints.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a held-out test to the
lewm_hillclimb_guidedflow (so the hill-climb no longer reports the headline on the same episodes it selected on) and builds the task-specialization ceiling experiment on top of it. All in saage — le-wm is not modified.The bug this fixes
The guided hill-climb selected the winning recipe (
keep_or_revertover N iterations) and reported the final number on the same 50-episode eval (cube.yaml, seed 42). That's test-set selection bias — the same thingflows/greenfield_mlavoids with a val/test split.Held-out test (Option A — saage-only eval seeds)
Stock le-wm
eval.pyalready samples eval episodes byseed, so:seed=val_seed(42), 50 episodesseed=test_seed(1234) + heavyeval.num_eval(200)The headline is scored on an episode sample the selection loop never saw → essentially no selection bias, with zero changes to the third-party le-wm repo. (Same-pool draws can overlap by ~1–2 episodes — negligible vs the 50-episode sampling noise. We deliberately chose this over a guaranteed-disjoint split in le-wm to avoid maintaining a fork we'd never upstream.)
Specialization-ceiling experiment (approach B)
Measures how much per-task recipe tuning buys vs the paper's transferable recipe:
gain_record.pycomputes it; the report skill leads with the layered headline + an honesty box (n=1 task, single seed, noisy 50-ep selection metric).Files
flow.yaml— paper-recipe headline steps,val_seed/test_seed/test_num_eval/paper_test_score/specialization_gainvars, gain_record step.gain_record.py(new) — deterministic gain helper.clean_ckpt.py— allowlewm_cube_confirm(fixes a pre-existing latent bug) +lewm_cube_paper.report/skill.md— layered headline, recipe-diff, honesty box.propose/proposal_criticskills — forbid touching the held-out eval seed/size.tests/—test_clean_ckpt_guided,test_gain_record,test_lewm_specialization_flow.docs/superpowers/{specs,plans}/2026-06-27-lewm-specialization*— design spec + implementation plan (with the Option A banner).Testing
Full suite 400 passed, 7 skipped (offline). The flow itself runs on GPU (spark-c0c0) — not exercised in CI; pilot run is the next step.
🤖 Generated with Claude Code