sophistry_bench_sprint_env: add training example and results#853
sophistry_bench_sprint_env: add training example and results#853acharyaanusha wants to merge 7 commits into
Conversation
…esults Adds the prime-rl GRPO config and per-step metrics from a 100-step run against the deployed env, plus a README section showing the reward-hacking signature (aggregate_reward up, correctness_reward flat). Also adds a from-scratch TRL GRPOTrainer example for training against the Space directly, for anyone without Prime Intellect access.
…heckpoint Closes the gap between local training output and an actual Hub artifact, matching the maintainer's "deployed to Hugging Face" ask for the training side, not just the Space.
The env is actually hosted at openenv-community/sophistry_bench_sprint_env, not anushaacharya/sophistry_bench_sprint_env as originally documented in huggingface#787 -- verified via `hf spaces info`.
… Intellect to a note Reframes the Training section so the TRL GRPOTrainer script (verified end-to-end against the deployed Space) is the primary documented path, matching this repo's own guidance that TRL is the recommended framework. The Prime Intellect run becomes supplementary evidence, not the headline. Also switches the script to GenericEnvClient + a directly-constructed UVProvider (avoiding a sync/async event-loop mismatch from mixing asyncio.run(from_env(...)) with .sync()), and bumps the default model to Qwen2.5-0.5B-Instruct for a cheaper, faster default run.
Runs the TRL GRPO example for real on Hugging Face Jobs (a10g-small, 100 steps, Qwen2.5-0.5B-Instruct) and documents the results honestly: the proxy reward (aggregate_reward) climbs and plateaus, confirming the example trains correctly end-to-end on HF infrastructure, but at this much smaller scale (~800 total rollouts vs. the Prime Intellect run's ~12,800) the policy collapses to near-empty completions rather than converging on the claim_count_cliff target -- a different reward-hacking shortcut, not a replication of the Prime Intellect run's specific curve. correctness_reward stays noisy/decoupled either way, which is the core finding both runs share. Also extends the reward_func to log per-step reward components (not just the scalar reward), since correctness_reward/n_claims live in observation["components"], which the trainer never needed but the README table does. Opts into SPRINT_EXPOSE_CORRECTNESS=1 for the locally-run clone (not the shared Space) since this is exactly the "trusted measurement code" use case the env's own README carves out -- never fed back into the prompt. Tuning notes from getting this to actually run without OOM on a10g-small: - per_device_train_batch_size is the *total* rollout count per step (must be divisible by num_generations), not unique-prompts * num_generations. - bf16 matters more than usual here: entropy/logprob computation materializes a [batch, completion_len, vocab_size] logits tensor, and a ~150K-token vocab (Qwen2.5) dominates memory at fp32. - gradient_checkpointing=True had no measurable effect in this setup (same OOM numbers with and without); reducing batch size was what actually fixed it. Left in since it's harmless, but don't rely on it alone.
…loop
EnvClient.connect() had a guard `if self._ws is not None: return self` that
no-ops regardless of which event loop established that websocket. This
breaks the officially documented pattern:
client = await SomeClient.from_env(...) # connects inside this loop
with client.sync() as sync_client: # drives calls on a NEW,
sync_client.reset() # separate background loop
`from_env()` ends with `await client.connect()`, binding `_ws` to whichever
loop ran that await (e.g. an `asyncio.run()` call, which is closed by the
time `from_env()` returns). `.sync()` then drives every later call through
`SyncEnvClient`'s own dedicated background-thread loop via
`run_coroutine_threadsafe`. `SyncEnvClient.connect()` does call
`self._async.connect()`, but the no-op guard returns immediately since `_ws`
is already set -- so the websocket never gets rebound to the loop that's
actually being used, and every `reset()`/`step()` call schedules work on a
live loop while operating on a connection object tied to a dead one. Found
while building a training example that does exactly this (huggingface#853) -- not
specific to that example; any `from_env()` + `.sync()` caller hits it.
Fix: track which loop created `_ws` (`_ws_loop`). `connect()` only no-ops if
the *current* running loop matches; otherwise it drops the stale reference
(unusable anyway, since its loop is typically already closed) and connects
fresh on the current loop. `disconnect()` clears `_ws_loop` alongside `_ws`.
Added `TestForeignLoopReconnect` (3 cases): same-loop reconnect stays a
no-op, a foreign-loop reconnect actually re-establishes the connection, and
disconnect() clears the loop-tracking state.
From a self-review pass before requesting maintainer review on huggingface#853: - Validate --per-device-batch-size % --num-generations == 0 up front, before the ~180s env clone/start and dataset build -- previously this only surfaced as an opaque ValueError deep inside GRPOTrainer construction. - Extract completion-text parsing into _completion_text(), which now raises a clear error on an empty/malformed completion list instead of a bare IndexError/TypeError. - Assert completions and seed are the same length in the reward function, instead of letting zip() silently truncate and misalign reward<->task. - Write the components CSV under output_dir (which save_model() already guarantees exists) instead of a sibling path derived from --out's basename, which could fail if --out's parent directory doesn't exist. - Extract the CSV-writing block into write_metrics_csv(). Also tried switching make_sync_client() to the simpler from_env() + .sync() pattern, now that huggingface#854 fixes the event-loop mismatch that motivated building it manually in the first place -- and reverted. The fixed connect() does correctly reconnect on the new loop instead of hanging, but it can't cleanly close the *old* connection first (its event loop is already gone), so the old one is simply abandoned. That's harmless for envs that allow concurrent sessions, but this one doesn't (SUPPORTS_CONCURRENT_SESSIONS = False): the abandoned connection occupies the only session slot, and the real one fails with CAPACITY_REACHED. Confirmed by reproducing it locally. make_sync_client() avoids the problem by never creating that doomed first connection at all. Updated its docstring to explain both reasons.
|
@burtenshaw the training example with the results! |
burtenshaw
left a comment
There was a problem hiding this comment.
In general it looks. I would just try to cut down the footprint of the example so it's easier to follow. You might want to move some explanation and figures into the docs, that is currently in the script and csvs.
There was a problem hiding this comment.
Also drop this, the envs and examples should be decoupled.
There was a problem hiding this comment.
Dropped, agreed -- moved the result numbers into the PR description / a one-paragraph summary in the env README instead.
| provider = UVProvider( | ||
| project_path=f"git+https://huggingface.co/spaces/{SPACE_REPO_ID}", | ||
| app="sophistry_bench_sprint_env.server.app:app", | ||
| # The default 60s readiness timeout can be too tight for a cold clone |
There was a problem hiding this comment.
Let's cut down the notes a lot.
There was a problem hiding this comment.
Cut the module docstring and inline comments down to the essentials -- the event-loop/single-session explanation now lives in #854's PR description, not duplicated here in prose.
| return client.sync() | ||
|
|
||
|
|
||
| def write_metrics_csv(metrics_log: list[dict], path: str) -> None: |
There was a problem hiding this comment.
I would use the native trackio integration in TRL or skip tracking.
There was a problem hiding this comment.
Dropped the custom metrics tracking entirely -- reward_func now just returns rewards, no per-step CSV/component logging in the script.
| return reward_func | ||
|
|
||
|
|
||
| def make_sync_client(): |
There was a problem hiding this comment.
We can do this in a single function.
There was a problem hiding this comment.
Inlined make_sync_client() into main() -- it was only used once.
Addressing @burtenshaw's review on huggingface#853: - Drop training/hf_jobs_metrics.csv, training/metrics.csv, training/sophistry_bench_sprint.toml -- envs and examples should be decoupled; these were training artifacts, not env source. - Drop the custom metrics_log/components-CSV tracking in the example script entirely (reward_func just returns rewards now, like every other reward_funcs example in this repo) rather than wiring up trackio for a one-off script. - Inline make_sync_client() into main() -- it was only used once. - Cut the module docstring and inline comments down to the essentials; the full event-loop/single-session explanation lives in huggingface#854 and the CAPACITY_REACHED finding, not duplicated here in prose. - Condense the env README's "Training" section from two tables + ~40 lines of analysis to one paragraph; the full numbers are in the PR description. Re-verified end-to-end after the rewrite (4 episodes, 1 step): trains, saves a real checkpoint, no regressions.
Summary
Follow-up to #787, addressing the maintainer's request to see this env "deployed to Hugging Face and a working example of training or inference."
examples/sophistry_bench_sprint_grpo.py— trains a policy on this env with TRL'sGRPOTrainer. Single-step env, so it's a plain prompt -> completion -> reward GRPO setup, noenvironment_factory/tool-calling needed. Connects via a manually-builtUVProvider+GenericEnvClientrather thanfrom_env()+.sync()(see the module docstring for why -- this env only allows one concurrent session, and the orphaned-connection issue in fix(uv-provider): clone git+ project paths before uv run --project #854 would occupy that slot). Only depends onopenenv[core]from PyPI, so it runs as a standaloneuvscript, including viahf jobs uv run.openenv-community/sophistry_bench_sprint_env, notanushaacharya/...as originally documented in Add sophistry_bench_sprint_env: single-agent advocacy reward-hacking environment #787).--per-device-batch-size % --num-generations == 0up front, guards against malformed/empty completions, assertscompletions/seedstay aligned.envs/(envs and examples decoupled), trimmed docstrings/comments to the essentials.Results
Both validated runs show
aggregate_reward(the optimized proxy) climbing whilecorrectness_reward(the hidden ground truth, weight 0) stays flat -- the reward-hacking signature this env is designed to surface:a10g-small,Qwen2.5-0.5B-Instruct, 100 steps, job6a3bfb825f9c8079e0fb2664):aggregate_rewardclimbs from ~0.35 to a ~0.50 plateau. At this scale (~800 total rollouts) the policy collapses to near-empty completions (n_claims-> ~0) rather than hitting theclaim_count_clifftarget -- a different reward-hacking shortcut than the Prime Intellect run below, not a replication of it.Llama-3.2-1B-Instruct, 100 steps, registered asanusha/sophistry-bench-sprint, parity-tested against this port):aggregate_rewardclimbs from ~0.48 to a ~0.77 plateau andn_claimssaturates at exactly 8 (the literalclaim_count_clifftarget) -- the textbook version of the signature.correctness_rewardstays noisy/decoupled from the optimized reward in both runs, which is the core finding they share.Test plan
python3 scripts/sync_env_docs.py --checkpassesCAPACITY_REACHEDfailure mode, now documented in the scriptUVProvidergit-clone fix in fix(uv-provider): clone git+ project paths before uv run --project #854 (needed for the no-Docker connection path); the script's PEP-723 header notes theopenenv[core]git-ref override needed until that fix is released🤖 Generated with Claude Code