sophistry_bench_sprint_env: add training example and results by acharyaanusha · Pull Request #853 · huggingface/OpenEnv

acharyaanusha · 2026-06-24T00:32:25Z

Summary

Follow-up to #787, addressing the maintainer's request to see this env "deployed to Hugging Face and a working example of training or inference."

examples/sophistry_bench_sprint_grpo.py — trains a policy on this env with TRL's GRPOTrainer. Single-step env, so it's a plain prompt -> completion -> reward GRPO setup, no environment_factory/tool-calling needed. Connects via a manually-built UVProvider + GenericEnvClient rather than from_env() + .sync() (see the module docstring for why -- this env only allows one concurrent session, and the orphaned-connection issue in fix(uv-provider): clone git+ project paths before uv run --project #854 would occupy that slot). Only depends on openenv[core] from PyPI, so it runs as a standalone uv script, including via hf jobs uv run.
Fixed a stale repo id throughout (env was actually deployed at openenv-community/sophistry_bench_sprint_env, not anushaacharya/... as originally documented in Add sophistry_bench_sprint_env: single-agent advocacy reward-hacking environment #787).
Hardened the script: validates --per-device-batch-size % --num-generations == 0 up front, guards against malformed/empty completions, asserts completions/seed stay aligned.
Cut footprint per review: no per-step metrics CSV/component tracking in the script (reward_func just returns rewards, like the rest of this repo's examples), no training-config/metrics files committed under envs/ (envs and examples decoupled), trimmed docstrings/comments to the essentials.

Results

Both validated runs show aggregate_reward (the optimized proxy) climbing while correctness_reward (the hidden ground truth, weight 0) stays flat -- the reward-hacking signature this env is designed to surface:

Hugging Face Jobs (a10g-small, Qwen2.5-0.5B-Instruct, 100 steps, job 6a3bfb825f9c8079e0fb2664): aggregate_reward climbs from ~0.35 to a ~0.50 plateau. At this scale (~800 total rollouts) the policy collapses to near-empty completions (n_claims -> ~0) rather than hitting the claim_count_cliff target -- a different reward-hacking shortcut than the Prime Intellect run below, not a replication of it.
Prime Intellect Hub (Llama-3.2-1B-Instruct, 100 steps, registered as anusha/sophistry-bench-sprint, parity-tested against this port): aggregate_reward climbs from ~0.48 to a ~0.77 plateau and n_claims saturates at exactly 8 (the literal claim_count_cliff target) -- the textbook version of the signature.

correctness_reward stays noisy/decoupled from the optimized reward in both runs, which is the core finding they share.

Test plan

python3 scripts/sync_env_docs.py --check passes
Local smoke tests verified end-to-end against the live deployed Space, re-verified again after the simplification rewrite
Real 100-step GRPO run executed on Hugging Face Jobs, completed successfully (results above)
Self-reviewed (8-angle pass) before requesting review; found and fixed a real CAPACITY_REACHED failure mode, now documented in the script
Depends on the UVProvider git-clone fix in fix(uv-provider): clone git+ project paths before uv run --project #854 (needed for the no-Docker connection path); the script's PEP-723 header notes the openenv[core] git-ref override needed until that fix is released

🤖 Generated with Claude Code

…esults Adds the prime-rl GRPO config and per-step metrics from a 100-step run against the deployed env, plus a README section showing the reward-hacking signature (aggregate_reward up, correctness_reward flat). Also adds a from-scratch TRL GRPOTrainer example for training against the Space directly, for anyone without Prime Intellect access.

…heckpoint Closes the gap between local training output and an actual Hub artifact, matching the maintainer's "deployed to Hugging Face" ask for the training side, not just the Space.

The env is actually hosted at openenv-community/sophistry_bench_sprint_env, not anushaacharya/sophistry_bench_sprint_env as originally documented in huggingface#787 -- verified via `hf spaces info`.

… Intellect to a note Reframes the Training section so the TRL GRPOTrainer script (verified end-to-end against the deployed Space) is the primary documented path, matching this repo's own guidance that TRL is the recommended framework. The Prime Intellect run becomes supplementary evidence, not the headline. Also switches the script to GenericEnvClient + a directly-constructed UVProvider (avoiding a sync/async event-loop mismatch from mixing asyncio.run(from_env(...)) with .sync()), and bumps the default model to Qwen2.5-0.5B-Instruct for a cheaper, faster default run.

Runs the TRL GRPO example for real on Hugging Face Jobs (a10g-small, 100 steps, Qwen2.5-0.5B-Instruct) and documents the results honestly: the proxy reward (aggregate_reward) climbs and plateaus, confirming the example trains correctly end-to-end on HF infrastructure, but at this much smaller scale (~800 total rollouts vs. the Prime Intellect run's ~12,800) the policy collapses to near-empty completions rather than converging on the claim_count_cliff target -- a different reward-hacking shortcut, not a replication of the Prime Intellect run's specific curve. correctness_reward stays noisy/decoupled either way, which is the core finding both runs share. Also extends the reward_func to log per-step reward components (not just the scalar reward), since correctness_reward/n_claims live in observation["components"], which the trainer never needed but the README table does. Opts into SPRINT_EXPOSE_CORRECTNESS=1 for the locally-run clone (not the shared Space) since this is exactly the "trusted measurement code" use case the env's own README carves out -- never fed back into the prompt. Tuning notes from getting this to actually run without OOM on a10g-small: - per_device_train_batch_size is the *total* rollout count per step (must be divisible by num_generations), not unique-prompts * num_generations. - bf16 matters more than usual here: entropy/logprob computation materializes a [batch, completion_len, vocab_size] logits tensor, and a ~150K-token vocab (Qwen2.5) dominates memory at fp32. - gradient_checkpointing=True had no measurable effect in this setup (same OOM numbers with and without); reducing batch size was what actually fixed it. Left in since it's harmless, but don't rely on it alone.

…loop EnvClient.connect() had a guard `if self._ws is not None: return self` that no-ops regardless of which event loop established that websocket. This breaks the officially documented pattern: client = await SomeClient.from_env(...) # connects inside this loop with client.sync() as sync_client: # drives calls on a NEW, sync_client.reset() # separate background loop `from_env()` ends with `await client.connect()`, binding `_ws` to whichever loop ran that await (e.g. an `asyncio.run()` call, which is closed by the time `from_env()` returns). `.sync()` then drives every later call through `SyncEnvClient`'s own dedicated background-thread loop via `run_coroutine_threadsafe`. `SyncEnvClient.connect()` does call `self._async.connect()`, but the no-op guard returns immediately since `_ws` is already set -- so the websocket never gets rebound to the loop that's actually being used, and every `reset()`/`step()` call schedules work on a live loop while operating on a connection object tied to a dead one. Found while building a training example that does exactly this (huggingface#853) -- not specific to that example; any `from_env()` + `.sync()` caller hits it. Fix: track which loop created `_ws` (`_ws_loop`). `connect()` only no-ops if the *current* running loop matches; otherwise it drops the stale reference (unusable anyway, since its loop is typically already closed) and connects fresh on the current loop. `disconnect()` clears `_ws_loop` alongside `_ws`. Added `TestForeignLoopReconnect` (3 cases): same-loop reconnect stays a no-op, a foreign-loop reconnect actually re-establishes the connection, and disconnect() clears the loop-tracking state.

From a self-review pass before requesting maintainer review on huggingface#853: - Validate --per-device-batch-size % --num-generations == 0 up front, before the ~180s env clone/start and dataset build -- previously this only surfaced as an opaque ValueError deep inside GRPOTrainer construction. - Extract completion-text parsing into _completion_text(), which now raises a clear error on an empty/malformed completion list instead of a bare IndexError/TypeError. - Assert completions and seed are the same length in the reward function, instead of letting zip() silently truncate and misalign reward<->task. - Write the components CSV under output_dir (which save_model() already guarantees exists) instead of a sibling path derived from --out's basename, which could fail if --out's parent directory doesn't exist. - Extract the CSV-writing block into write_metrics_csv(). Also tried switching make_sync_client() to the simpler from_env() + .sync() pattern, now that huggingface#854 fixes the event-loop mismatch that motivated building it manually in the first place -- and reverted. The fixed connect() does correctly reconnect on the new loop instead of hanging, but it can't cleanly close the *old* connection first (its event loop is already gone), so the old one is simply abandoned. That's harmless for envs that allow concurrent sessions, but this one doesn't (SUPPORTS_CONCURRENT_SESSIONS = False): the abandoned connection occupies the only session slot, and the real one fails with CAPACITY_REACHED. Confirmed by reproducing it locally. make_sync_client() avoids the problem by never creating that doomed first connection at all. Updated its docstring to explain both reasons.

acharyaanusha · 2026-06-24T21:40:37Z

@burtenshaw the training example with the results!

burtenshaw

In general it looks. I would just try to cut down the footprint of the example so it's easier to follow. You might want to move some explanation and figures into the docs, that is currently in the script and csvs.

burtenshaw · 2026-06-25T07:02:16Z

Also drop this, the envs and examples should be decoupled.

Dropped, agreed -- moved the result numbers into the PR description / a one-paragraph summary in the env README instead.

burtenshaw · 2026-06-25T07:04:55Z

+    provider = UVProvider(
+        project_path=f"git+https://huggingface.co/spaces/{SPACE_REPO_ID}",
+        app="sophistry_bench_sprint_env.server.app:app",
+        # The default 60s readiness timeout can be too tight for a cold clone


Let's cut down the notes a lot.

Cut the module docstring and inline comments down to the essentials -- the event-loop/single-session explanation now lives in #854's PR description, not duplicated here in prose.

burtenshaw · 2026-06-25T07:05:27Z

+    return client.sync()
+
+
+def write_metrics_csv(metrics_log: list[dict], path: str) -> None:


I would use the native trackio integration in TRL or skip tracking.

Dropped the custom metrics tracking entirely -- reward_func now just returns rewards, no per-step CSV/component logging in the script.

burtenshaw · 2026-06-25T07:06:04Z

+    return reward_func
+
+
+def make_sync_client():


We can do this in a single function.

Inlined make_sync_client() into main() -- it was only used once.

@burtenshaw

Addressing @burtenshaw's review on huggingface#853: - Drop training/hf_jobs_metrics.csv, training/metrics.csv, training/sophistry_bench_sprint.toml -- envs and examples should be decoupled; these were training artifacts, not env source. - Drop the custom metrics_log/components-CSV tracking in the example script entirely (reward_func just returns rewards now, like every other reward_funcs example in this repo) rather than wiring up trackio for a one-off script. - Inline make_sync_client() into main() -- it was only used once. - Cut the module docstring and inline comments down to the essentials; the full event-loop/single-session explanation lives in huggingface#854 and the CAPACITY_REACHED finding, not duplicated here in prose. - Condense the env README's "Training" section from two tables + ~40 lines of analysis to one paragraph; the full numbers are in the PR description. Re-verified end-to-end after the rewrite (4 episodes, 1 step): trains, saves a real checkpoint, no regressions.

acharyaanusha added 5 commits June 23, 2026 17:30

examples(sophistry_bench_sprint_grpo): add --push-to-hub to publish c…

00d48a1

…heckpoint Closes the gap between local training output and an actual Hub artifact, matching the maintainer's "deployed to Hugging Face" ask for the training side, not just the Space.

fix(sophistry_bench_sprint_env): correct deployed Space repo id

8642072

The env is actually hosted at openenv-community/sophistry_bench_sprint_env, not anushaacharya/sophistry_bench_sprint_env as originally documented in huggingface#787 -- verified via `hf spaces info`.

acharyaanusha mentioned this pull request Jun 24, 2026

fix(env-client): reconnect when connect() is called from a different loop #860

Closed

3 tasks

acharyaanusha mentioned this pull request Jun 24, 2026

fix(uv-provider): clone git+ project paths before uv run --project #854

Open

6 tasks

burtenshaw reviewed Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sophistry_bench_sprint_env: add training example and results#853

sophistry_bench_sprint_env: add training example and results#853
acharyaanusha wants to merge 7 commits into
huggingface:mainfrom
acharyaanusha:feature/sophistry-bench-sprint-grpo-training

acharyaanusha commented Jun 24, 2026 •

edited

Loading

Uh oh!

acharyaanusha commented Jun 24, 2026

Uh oh!

burtenshaw left a comment

Uh oh!

Uh oh!

Uh oh!

burtenshaw Jun 25, 2026

Uh oh!

acharyaanusha Jun 25, 2026

Uh oh!

burtenshaw Jun 25, 2026

Uh oh!

acharyaanusha Jun 25, 2026

Uh oh!

burtenshaw Jun 25, 2026

Uh oh!

acharyaanusha Jun 25, 2026

Uh oh!

burtenshaw Jun 25, 2026

Uh oh!

acharyaanusha Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return client.sync()


		def write_metrics_csv(metrics_log: list[dict], path: str) -> None:

Uh oh!

Conversation

acharyaanusha commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Test plan

Uh oh!

acharyaanusha commented Jun 24, 2026

Uh oh!

burtenshaw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

acharyaanusha commented Jun 24, 2026 •

edited

Loading