[Draft] core: harness-rollout contract + runnable example for online RL with TRL#864
Draft
sergiopaniego wants to merge 4 commits into
Draft
[Draft] core: harness-rollout contract + runnable example for online RL with TRL#864sergiopaniego wants to merge 4 commits into
sergiopaniego wants to merge 4 commits into
Conversation
…or online RL with TRL
| ) -> str: | ||
| """Generate one assistant turn. The trainer records token_ids + logprobs keyed by | ||
| (rollout_id, turn) and returns the completion text.""" | ||
| ... |
| class AgentSession(Protocol): | ||
| def next_request(self) -> dict | None: | ||
| """Intercepted agent LLM call ({messages, tools, request_id}), or None when the agent exits.""" | ||
| ... |
| """Intercepted agent LLM call ({messages, tools, request_id}), or None when the agent exits.""" | ||
| ... | ||
|
|
||
| def deliver(self, intercept: dict, completion_text: str) -> None: ... |
| def deliver(self, intercept: dict, completion_text: str) -> None: ... | ||
| def verify( | ||
| self, | ||
| ) -> Any: ... # returns an object with .env_reward (reward stays in the env) |
| def verify( | ||
| self, | ||
| ) -> Any: ... # returns an object with .env_reward (reward stays in the env) | ||
| def close(self) -> None: ... |
|
|
||
|
|
||
| class AgentSessionFactory(Protocol): | ||
| def create(self, *, task: Any, rollout_id: str) -> AgentSession: ... |
| ) -> str: | ||
| """Generate one assistant turn. The trainer records token_ids + logprobs keyed by | ||
| (rollout_id, turn) and returns the completion text.""" | ||
| ... |
| class AgentSession(Protocol): | ||
| def next_request(self) -> dict | None: | ||
| """Intercepted agent LLM call ({messages, tools, request_id}), or None when the agent exits.""" | ||
| ... |
| """Intercepted agent LLM call ({messages, tools, request_id}), or None when the agent exits.""" | ||
| ... | ||
|
|
||
| def deliver(self, intercept: dict, completion_text: str) -> None: ... |
| def deliver(self, intercept: dict, completion_text: str) -> None: ... | ||
| def verify( | ||
| self, | ||
| ) -> Any: ... # returns an object with .env_reward (reward stays in the env) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[Draft] core: harness-rollout contract + runnable example for online RL with TRL
Summary
Two parts, in one PR so the whole dynamic is reviewable together:
1. The contract (OpenEnv core,
src/openenv/core/harness/):rollout.pyAgentSession,AgentSessionFactory,RolloutMessages,GenerateAPI.interception.pyInterceptionServer: a minimal OpenAI-compatible proxy that gates a harness's LLM calls.This is the minimal seam, deliberately small and readable (in contrast to the larger
interception/sandbox stack in #694 and the SWE worker in #695, which are hard to review as one unit).
2. The runnable example (
examples/trl_harness_rollout/):A
HarnessRolloutWorkerdriving a ReAct harness through the interception proxy, scoring with the env'sverify(), emitting message-level rollouts. Runs with no GPU (a scripted generator) and with real vLLM(real per-turn
token_ids+logprobscapture). Self-contained: it vendors the two contract files soit runs with just
aiohttp+requestsand is copy-pasteable into a trainer.Why
A harness owns its own loop. To train it on-policy you gate its LLM calls (interception), let the
trainer generate and capture exact tokens, score the episode with the env's
verify(), and turn thetranscript into a training sample. This names the small seam between the two sides so a trainer can
build a rollout worker against it without OpenEnv depending on any training-specific shapes.
The two seams
generate(rollout_id, turn, messages, tools, sampling) -> completion_text. Generate with the trainer's engine, record token_ids + logprobs keyed by(rollout_id, turn). This is where messages-to-tokens / TITO lives.RolloutMessages{rollout_id, messages, reward}.Reward comes from the env's
verify(). The trainer stitches the captured tokens into a sample.Where the worker really lives
The
HarnessRolloutWorkershown inexamples/is the trainer side. It conforms to TRL'sRolloutWorkerProtocol, so for a real integration it gets vendored into TRL (confirmed direction).It is included here, runnable, so the contract is not just interfaces but something you can read and
execute in one place. The worker is harness-agnostic and framework-neutral: TRL is the first consumer,
not the only possible one.
This is complementary to #694 (richer interception + sandbox backends + Pi/OpenCode adapters) and #695
(SWE env + a full AsyncGRPO worker). The minimal
interception.pyhere should be reconciled with#694's
InterceptionServer(drop in favor of it, or fold in the missing features). Flagged in thedocstring.
Type of Change
Alignment Checklist
verify); the interception tool surface does not expose control-planetools to the agent. Aligns with
.claude/docs/PRINCIPLES.md/INVARIANTS.md.usort+ruffclean.RFC Status