fix(uv-provider): clone git+ project paths before uv run --project#854
fix(uv-provider): clone git+ project paths before uv run --project#854acharyaanusha wants to merge 6 commits into
Conversation
EnvClient.from_env(repo_id, use_docker=False) builds a project_path of git+https://huggingface.co/spaces/{repo_id} and hands it straight to UVProvider, which passed it through os.path.abspath() (mangling the URL, e.g. collapsing "https://" to "https:/") and then to `uv run --project`, which only discovers a project in a local directory -- it has no support for remote git sources at all. The documented "clones the space and runs locally" behavior never actually happened; reset()/step() would hang until the 60s readiness timeout. Verified by reproducing against a real deployed Space (openenv-community/sophistry_bench_sprint_env) before and after. Fix: UVProvider now detects a git+ project_path, leaves it untouched until start(), clones it to a temp dir there, and passes the local clone path to uv run. The temp dir is removed in stop().
|
@burtenshaw Updated this PR — it now covers a second, related framework bug found in the same training example: |
Three follow-up fixes from review of the git-clone fix: 1. provider.start()/wait_for_ready() in EnvClient.from_env()'s UV branch run before the EnvClient exists, so a failure there left no object to call stop() on -- the cloned temp dir (and process) leaked permanently on every such failure. Now wrapped so any exception during start/wait_for_ready/ connect calls provider.stop() before re-raising. 2. _clone_git_project had no timeout, so a hung/slow git remote blocked start() (and the caller's readiness wait) indefinitely. Now bounded by context_timeout_s, cleaning up the partial clone dir on timeout. 3. start()'s "already running" guard only checks process liveness, not whether a previous run's clone dir was ever cleaned up. If the spawned process died on its own and start() was called again, the new clone silently overwrote self._clone_dir, leaking the old directory. start() now clears any stale clone dir before cloning again, and also cleans up on a Popen failure right after a successful clone. Also fixes a docstring convention nit (double backticks where the file/ CLAUDE.md convention is single backticks) in the lines this PR touches.
…aanusha/OpenEnv into fix/uv-provider-git-clone
…loop
EnvClient.connect() had a guard `if self._ws is not None: return self` that
no-ops regardless of which event loop established that websocket. This
breaks the officially documented pattern:
client = await SomeClient.from_env(...) # connects inside this loop
with client.sync() as sync_client: # drives calls on a NEW,
sync_client.reset() # separate background loop
`from_env()` ends with `await client.connect()`, binding `_ws` to whichever
loop ran that await (e.g. an `asyncio.run()` call, which is closed by the
time `from_env()` returns). `.sync()` then drives every later call through
`SyncEnvClient`'s own dedicated background-thread loop via
`run_coroutine_threadsafe`. `SyncEnvClient.connect()` does call
`self._async.connect()`, but the no-op guard returns immediately since `_ws`
is already set -- so the websocket never gets rebound to the loop that's
actually being used, and every `reset()`/`step()` call schedules work on a
live loop while operating on a connection object tied to a dead one. Found
while building a training example that does exactly this (huggingface#853) -- not
specific to that example; any `from_env()` + `.sync()` caller hits it.
Fix: track which loop created `_ws` (`_ws_loop`). `connect()` only no-ops if
the *current* running loop matches; otherwise it drops the stale reference
(unusable anyway, since its loop is typically already closed) and connects
fresh on the current loop. `disconnect()` clears `_ws_loop` alongside `_ws`.
Added `TestForeignLoopReconnect` (3 cases): same-loop reconnect stays a
no-op, a foreign-loop reconnect actually re-establishes the connection, and
disconnect() clears the loop-tracking state.
From a self-review pass before requesting maintainer review on huggingface#853: - Validate --per-device-batch-size % --num-generations == 0 up front, before the ~180s env clone/start and dataset build -- previously this only surfaced as an opaque ValueError deep inside GRPOTrainer construction. - Extract completion-text parsing into _completion_text(), which now raises a clear error on an empty/malformed completion list instead of a bare IndexError/TypeError. - Assert completions and seed are the same length in the reward function, instead of letting zip() silently truncate and misalign reward<->task. - Write the components CSV under output_dir (which save_model() already guarantees exists) instead of a sibling path derived from --out's basename, which could fail if --out's parent directory doesn't exist. - Extract the CSV-writing block into write_metrics_csv(). Also tried switching make_sync_client() to the simpler from_env() + .sync() pattern, now that huggingface#854 fixes the event-loop mismatch that motivated building it manually in the first place -- and reverted. The fixed connect() does correctly reconnect on the new loop instead of hanging, but it can't cleanly close the *old* connection first (its event loop is already gone), so the old one is simply abandoned. That's harmless for envs that allow concurrent sessions, but this one doesn't (SUPPORTS_CONCURRENT_SESSIONS = False): the abandoned connection occupies the only session slot, and the real one fails with CAPACITY_REACHED. Confirmed by reproducing it locally. make_sync_client() avoids the problem by never creating that doomed first connection at all. Updated its docstring to explain both reasons.
| self._ws = None | ||
| self._ws_loop = None | ||
|
|
||
| async def _ensure_connected(self) -> None: |
There was a problem hiding this comment.
Because thi reconnects when _ws is None. So client = await Client.from_env(...); client.sync().reset() without entering with client.sync() keeps using the websocket from the foreign loop, because sync_client.py (line 185) calls reset() directly and relies on _ensure_connected(). Move the loop-compatibility check into _ensure_connected() or always call connect() before send/receive; add a test for lazy client.sync().reset().
There was a problem hiding this comment.
Good catch, thanks. You're right -- _ensure_connected() had its own independent guard (if self._ws is None) that bypassed connect()'s loop-aware reconnect logic entirely whenever _ws was already set, which is exactly the foreign-loop case. Fixed in 78a1124: _ensure_connected() now always delegates to connect() instead of pre-checking _ws is None itself. Added test_lazy_send_reconnects_without_explicit_connect_call exercising the exact client.sync().reset()-without-prior-connect() pattern you described.
@burtenshaw caught a gap in the previous fix: _ensure_connected() (called by _send()/reset()/step()) had its own guard -- `if self._ws is None: await self.connect()` -- that's independent of connect()'s loop-aware reconnect logic. Since _ws is NOT None in the foreign-loop case (it's set, just bound to a dead loop), this guard skipped connect() entirely, so the reconnect fix never ran for callers that never explicitly call .connect() themselves -- e.g. `client = await Client.from_env(...); client.sync().reset()`, which goes straight to _send() -> _ensure_connected(). Fix: _ensure_connected() now always delegates to connect(), which already knows how to no-op cheaply on the same loop or reconnect on a different one. Added test_lazy_send_reconnects_without_explicit_connect_call, exercising _ensure_connected() directly in the exact lazy-call pattern that was broken.
Addressing @burtenshaw's review on huggingface#853: - Drop training/hf_jobs_metrics.csv, training/metrics.csv, training/sophistry_bench_sprint.toml -- envs and examples should be decoupled; these were training artifacts, not env source. - Drop the custom metrics_log/components-CSV tracking in the example script entirely (reward_func just returns rewards now, like every other reward_funcs example in this repo) rather than wiring up trackio for a one-off script. - Inline make_sync_client() into main() -- it was only used once. - Cut the module docstring and inline comments down to the essentials; the full event-loop/single-session explanation lives in huggingface#854 and the CAPACITY_REACHED finding, not duplicated here in prose. - Condense the env README's "Training" section from two tables + ~40 lines of analysis to one paragraph; the full numbers are in the PR description. Re-verified end-to-end after the rewrite (4 episodes, 1 step): trains, saves a real checkpoint, no regressions.
Summary
Two related framework bugs found while building a real training example against a deployed Space (#853).
1.
UVProvider/from_env(use_docker=False)never actually clones the git URLEnvClient.from_env(repo_id, use_docker=False)is documented to "clone the space and run it locally with uv," and builds aproject_pathofgit+https://huggingface.co/spaces/{repo_id}for that purpose. ButUVProvider:os.path.abspath(), which mangles it (e.g. collapseshttps://tohttps:/and prefixes the cwd).uv run --project <path>, which only discovers a project in a local directory — it has no support for remote git sources.So the "clones the space" behavior never happened:
reset()/step()would hang until the 60s readiness timeout.Fix:
UVProvidernow detects agit+-prefixedproject_path, leaves it untouched in__init__(noabspathmangling), and clones it to a temp directory instart()viagit clone --depth 1. The temp dir is removed instop(). Also closes three follow-up leaks found in review: a temp-dir/process leak ifstart()/wait_for_ready()fails before theEnvClientexists to clean up, no timeout on thegit clonesubprocess (a hung remote blockedstart()forever), and a leak if the spawned process died on its own andstart()was called again.2.
connect()no-ops onto a websocket bound to a dead event loopEnvClient.connect()has a guard —if self._ws is not None: return self— that ignores which event loop established that websocket. This breaks the officially documented pattern (docs/source/guides/async-sync.md):from_env()ends withawait client.connect(), binding_wsto whichever loop ran that await (e.g. anasyncio.run()call, closed by the timefrom_env()returns)..sync()then drives every later call throughSyncEnvClient's own dedicated background-thread loop. The no-op guard means the websocket never gets rebound — everyreset()/step()schedules work on a live loop while operating on a connection tied to a dead one, causing a hang or an asyncio "Future attached to a different loop" error. Not specific to any one example — anyfrom_env()+.sync()caller hits this.Fix:
EnvClientnow tracks which loop created_ws(self._ws_loop).connect()only treats an existing_wsas reusable if the current running loop matches; otherwise it drops the stale reference and connects fresh.Known limitation, found while testing this against #853: the fix reconnects correctly, but it can't cleanly close the old connection first — its event loop is already gone, so there's no way to send a proper close frame on it. For environments that allow concurrent sessions this is harmless (just one extra idle connection on the server until it times out). But an environment with
SUPPORTS_CONCURRENT_SESSIONS = False(single session, e.g.sophistry_bench_sprint_env) will reject the new (real) connection with a capacity error, because the abandoned old one still occupies the session slot from the server's point of view.#853's training script therefore still avoidsfrom_env()+.sync()for that specific env and connects viaUVProviderdirectly instead — this fix narrows the bug's blast radius (no more hang/crash for the common case) but doesn't make the pattern universally safe for single-session environments. Flagging this explicitly rather than overclaiming; a full fix would need a way to force-close a foreign-loop-bound websocket, which isn't straightforward with thewebsocketslibrary's public API.Test plan
tests/test_core/test_uv_provider.py:TestGitProjectPath(8 cases) covering the clone path, the timeout, and both leak scenarios.tests/test_core/test_generic_client.py:TestForeignLoopReconnect(3 cases) covering same-loop no-op, foreign-loop reconnect, anddisconnect()clearing loop state.pytest tests/test_core/ -q— 198 passed, 11 pre-existing skips.ruff format --check/ruff checkclean.openenv-community/sophistry_bench_sprint_env) — before the fix,from_env(..., use_docker=False)times out after 60s; after, it clones, starts the server, and a realreset()/step()round-trip succeeds.CAPACITY_REACHEDerror), confirming the caveat is accurate rather than theoretical.🤖 Generated with Claude Code