[Draft] core: harness-rollout contract + runnable example for online RL with TRL by sergiopaniego · Pull Request #864 · huggingface/OpenEnv

sergiopaniego · 2026-06-25T13:43:56Z

[Draft] core: harness-rollout contract + runnable example for online RL with TRL

Draft, single self-contained artifact. A small, framework-neutral contract in OpenEnv core for
training an agentic harness (an agent that owns its loop) with an online RL trainer, plus a runnable
example that exercises it end-to-end (no GPU, and with real vLLM). Built to be read in one place, not
to be merged as-is.

Summary

Two parts, in one PR so the whole dynamic is reviewable together:

1. The contract (OpenEnv core, src/openenv/core/harness/):

File	Role
`rollout.py`	The contract (interfaces only): `AgentSession`, `AgentSessionFactory`, `RolloutMessages`, `GenerateAPI`.
`interception.py`	`InterceptionServer`: a minimal OpenAI-compatible proxy that gates a harness's LLM calls.

This is the minimal seam, deliberately small and readable (in contrast to the larger
interception/sandbox stack in #694 and the SWE worker in #695, which are hard to review as one unit).

2. The runnable example (examples/trl_harness_rollout/):

A HarnessRolloutWorker driving a ReAct harness through the interception proxy, scoring with the env's
verify(), emitting message-level rollouts. Runs with no GPU (a scripted generator) and with real vLLM
(real per-turn token_ids + logprobs capture). Self-contained: it vendors the two contract files so
it runs with just aiohttp + requests and is copy-pasteable into a trainer.

Why

A harness owns its own loop. To train it on-policy you gate its LLM calls (interception), let the
trainer generate and capture exact tokens, score the episode with the env's verify(), and turn the
transcript into a training sample. This names the small seam between the two sides so a trainer can
build a rollout worker against it without OpenEnv depending on any training-specific shapes.

The two seams

Seam 1 (the trainer implements): generate(rollout_id, turn, messages, tools, sampling) -> completion_text. Generate with the trainer's engine, record token_ids + logprobs keyed by
(rollout_id, turn). This is where messages-to-tokens / TITO lives.
Seam 2 (the worker emits, the trainer consumes): RolloutMessages{rollout_id, messages, reward}.
Reward comes from the env's verify(). The trainer stitches the captured tokens into a sample.

Where the worker really lives

The HarnessRolloutWorker shown in examples/ is the trainer side. It conforms to TRL's
RolloutWorkerProtocol, so for a real integration it gets vendored into TRL (confirmed direction).
It is included here, runnable, so the contract is not just interfaces but something you can read and
execute in one place. The worker is harness-agnostic and framework-neutral: TRL is the first consumer,
not the only possible one.

This is complementary to #694 (richer interception + sandbox backends + Pi/OpenCode adapters) and #695
(SWE env + a full AsyncGRPO worker). The minimal interception.py here should be reconciled with
#694's InterceptionServer (drop in favor of it, or fold in the missing features). Flagged in the
docstring.

Weight sync (trainer side): TRL handles it on the default path (environment_factory + the built-in
AsyncRolloutWorker). On the injected-worker path it sets weight_transfer = None and
_sync_weight() is a no-op, so the trainer side wires it (worker-driven, the way #695 does, or by
extending _sync_weight). Not an OpenEnv concern, noted for completeness.

Type of Change

Alignment Checklist

Reward stays in the env (verify); the interception tool surface does not expose control-plane
tools to the agent. Aligns with .claude/docs/PRINCIPLES.md / INVARIANTS.md.
usort + ruff clean.

RFC Status

Not required. Relates to RFC 005 (Agentic Harness Integration).

… TRL

…or online RL with TRL

+    ) -> str:
+        """Generate one assistant turn. The trainer records token_ids + logprobs keyed by
+        (rollout_id, turn) and returns the completion text."""
+        ...


+class AgentSession(Protocol):
+    def next_request(self) -> dict | None:
+        """Intercepted agent LLM call ({messages, tools, request_id}), or None when the agent exits."""
+        ...


+        """Intercepted agent LLM call ({messages, tools, request_id}), or None when the agent exits."""
+        ...
+
+    def deliver(self, intercept: dict, completion_text: str) -> None: ...


+    def deliver(self, intercept: dict, completion_text: str) -> None: ...
+    def verify(
+        self,
+    ) -> Any: ...  # returns an object with .env_reward (reward stays in the env)


+    def verify(
+        self,
+    ) -> Any: ...  # returns an object with .env_reward (reward stays in the env)
+    def close(self) -> None: ...


+
+
+class AgentSessionFactory(Protocol):
+    def create(self, *, task: Any, rollout_id: str) -> AgentSession: ...


+    ) -> str:
+        """Generate one assistant turn. The trainer records token_ids + logprobs keyed by
+        (rollout_id, turn) and returns the completion text."""
+        ...


+class AgentSession(Protocol):
+    def next_request(self) -> dict | None:
+        """Intercepted agent LLM call ({messages, tools, request_id}), or None when the agent exits."""
+        ...


+        """Intercepted agent LLM call ({messages, tools, request_id}), or None when the agent exits."""
+        ...
+
+    def deliver(self, intercept: dict, completion_text: str) -> None: ...


+    def deliver(self, intercept: dict, completion_text: str) -> None: ...
+    def verify(
+        self,
+    ) -> Any: ...  # returns an object with .env_reward (reward stays in the env)


sergiopaniego added 2 commits June 25, 2026 14:59

[draft] examples: add harness rollout worker draft for online RL with…

ba50fdc

… TRL

core, examples: harness rollout worker for online RL with TRL

c14145e

github-code-quality Bot found potential problems Jun 25, 2026

View reviewed changes

core, examples: minimal harness-rollout contract + runnable example f…

e2ed120

…or online RL with TRL

sergiopaniego changed the title ~~[Draft] core, examples: harness rollout worker for online RL with TRL~~ [Draft] core: harness-rollout contract + runnable example for online RL with TRL Jun 25, 2026

github-code-quality Bot found potential problems Jun 25, 2026

View reviewed changes

core: narrow interception start() error annotation to Exception

e1e4311

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Draft] core: harness-rollout contract + runnable example for online RL with TRL#864

[Draft] core: harness-rollout contract + runnable example for online RL with TRL#864
sergiopaniego wants to merge 4 commits into
mainfrom
feat/trl-harness-rollout-core

sergiopaniego commented Jun 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant



		class AgentSessionFactory(Protocol):
		def create(self, *, task: Any, rollout_id: str) -> AgentSession: ...

Uh oh!

Conversation

sergiopaniego commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[Draft] core: harness-rollout contract + runnable example for online RL with TRL

Summary

Why

The two seams

Where the worker really lives

Type of Change

Alignment Checklist

RFC Status

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sergiopaniego commented Jun 25, 2026 •

edited

Loading