Skip to content

[Draft] core: harness-rollout contract + runnable example for online RL with TRL#864

Draft
sergiopaniego wants to merge 4 commits into
mainfrom
feat/trl-harness-rollout-core
Draft

[Draft] core: harness-rollout contract + runnable example for online RL with TRL#864
sergiopaniego wants to merge 4 commits into
mainfrom
feat/trl-harness-rollout-core

Conversation

@sergiopaniego

@sergiopaniego sergiopaniego commented Jun 25, 2026

Copy link
Copy Markdown
Member

[Draft] core: harness-rollout contract + runnable example for online RL with TRL

Draft, single self-contained artifact. A small, framework-neutral contract in OpenEnv core for
training an agentic harness (an agent that owns its loop) with an online RL trainer, plus a runnable
example that exercises it end-to-end (no GPU, and with real vLLM). Built to be read in one place, not
to be merged as-is.

Summary

Two parts, in one PR so the whole dynamic is reviewable together:

1. The contract (OpenEnv core, src/openenv/core/harness/):

File Role
rollout.py The contract (interfaces only): AgentSession, AgentSessionFactory, RolloutMessages, GenerateAPI.
interception.py InterceptionServer: a minimal OpenAI-compatible proxy that gates a harness's LLM calls.

This is the minimal seam, deliberately small and readable (in contrast to the larger
interception/sandbox stack in #694 and the SWE worker in #695, which are hard to review as one unit).

2. The runnable example (examples/trl_harness_rollout/):

A HarnessRolloutWorker driving a ReAct harness through the interception proxy, scoring with the env's
verify(), emitting message-level rollouts. Runs with no GPU (a scripted generator) and with real vLLM
(real per-turn token_ids + logprobs capture). Self-contained: it vendors the two contract files so
it runs with just aiohttp + requests and is copy-pasteable into a trainer.

Why

A harness owns its own loop. To train it on-policy you gate its LLM calls (interception), let the
trainer generate and capture exact tokens, score the episode with the env's verify(), and turn the
transcript into a training sample. This names the small seam between the two sides so a trainer can
build a rollout worker against it without OpenEnv depending on any training-specific shapes.

The two seams

  • Seam 1 (the trainer implements): generate(rollout_id, turn, messages, tools, sampling) -> completion_text. Generate with the trainer's engine, record token_ids + logprobs keyed by
    (rollout_id, turn). This is where messages-to-tokens / TITO lives.
  • Seam 2 (the worker emits, the trainer consumes): RolloutMessages{rollout_id, messages, reward}.
    Reward comes from the env's verify(). The trainer stitches the captured tokens into a sample.

Where the worker really lives

The HarnessRolloutWorker shown in examples/ is the trainer side. It conforms to TRL's
RolloutWorkerProtocol, so for a real integration it gets vendored into TRL (confirmed direction).
It is included here, runnable, so the contract is not just interfaces but something you can read and
execute in one place. The worker is harness-agnostic and framework-neutral: TRL is the first consumer,
not the only possible one.

This is complementary to #694 (richer interception + sandbox backends + Pi/OpenCode adapters) and #695
(SWE env + a full AsyncGRPO worker). The minimal interception.py here should be reconciled with
#694's InterceptionServer (drop in favor of it, or fold in the missing features). Flagged in the
docstring.

Weight sync (trainer side): TRL handles it on the default path (environment_factory + the built-in
AsyncRolloutWorker). On the injected-worker path it sets weight_transfer = None and
_sync_weight() is a no-op, so the trainer side wires it (worker-driven, the way #695 does, or by
extending _sync_weight). Not an OpenEnv concern, noted for completeness.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation
  • New environment
  • Refactoring

Alignment Checklist

  • Reward stays in the env (verify); the interception tool surface does not expose control-plane
    tools to the agent. Aligns with .claude/docs/PRINCIPLES.md / INVARIANTS.md.
  • usort + ruff clean.

RFC Status

  • Not required. Relates to RFC 005 (Agentic Harness Integration).

Comment thread src/openenv/core/harness/interception.py Fixed
Comment thread src/openenv/core/harness/interception.py Fixed
Comment thread src/openenv/core/harness/rollout_worker.py Fixed
Comment thread src/openenv/core/harness/rollout_worker.py Fixed
Comment thread src/openenv/core/harness/rollout_worker.py Fixed
Comment thread src/openenv/core/harness/rollout_worker.py Fixed
Comment thread src/openenv/core/harness/rollout_worker.py Fixed
Comment thread src/openenv/core/harness/rollout_worker.py Fixed
@sergiopaniego sergiopaniego changed the title [Draft] core, examples: harness rollout worker for online RL with TRL [Draft] core: harness-rollout contract + runnable example for online RL with TRL Jun 25, 2026
) -> str:
"""Generate one assistant turn. The trainer records token_ids + logprobs keyed by
(rollout_id, turn) and returns the completion text."""
...
class AgentSession(Protocol):
def next_request(self) -> dict | None:
"""Intercepted agent LLM call ({messages, tools, request_id}), or None when the agent exits."""
...
"""Intercepted agent LLM call ({messages, tools, request_id}), or None when the agent exits."""
...

def deliver(self, intercept: dict, completion_text: str) -> None: ...
def deliver(self, intercept: dict, completion_text: str) -> None: ...
def verify(
self,
) -> Any: ... # returns an object with .env_reward (reward stays in the env)
def verify(
self,
) -> Any: ... # returns an object with .env_reward (reward stays in the env)
def close(self) -> None: ...


class AgentSessionFactory(Protocol):
def create(self, *, task: Any, rollout_id: str) -> AgentSession: ...
) -> str:
"""Generate one assistant turn. The trainer records token_ids + logprobs keyed by
(rollout_id, turn) and returns the completion text."""
...
class AgentSession(Protocol):
def next_request(self) -> dict | None:
"""Intercepted agent LLM call ({messages, tools, request_id}), or None when the agent exits."""
...
"""Intercepted agent LLM call ({messages, tools, request_id}), or None when the agent exits."""
...

def deliver(self, intercept: dict, completion_text: str) -> None: ...
def deliver(self, intercept: dict, completion_text: str) -> None: ...
def verify(
self,
) -> Any: ... # returns an object with .env_reward (reward stays in the env)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant