Summary
This RFC proposes a generic recipe for integrating arbitrary third-party agents into the uni-agent training pipeline as black boxes. The training infrastructure treats the agent as an opaque entity that communicates with the policy model solely through a gateway. The gateway intercepts every LLM call, collects token-level trajectories, and feeds them to the RL trainer — all without any knowledge of the agent's internal control flow (tool orchestration, prompting strategy, state management).
Two reference implementations are provided: one built with uni-agent components, and one wrapping the third-party mini-swe-agent.
Motivation
Many mature agent frameworks (OpenHands, SWE-agent, mini-swe-agent, etc.) already have well-tuned interaction loops, tool integrations, and prompting strategies. Rewriting them to fit a specific training framework is costly and fragile.
A blackbox approach solves this: plug any agent into the training pipeline, and the gateway-transparent architecture ensures the RL trainer can observe and optimize every LLM call while the agent logic stays untouched.
This enables:
- Zero-cost agent migration: bring your existing agent, write a thin runner adapter, start training.
- Agent-agnostic training: swap between different agent implementations without changing training config.
- Decoupled iteration: improve agent logic and training hyperparameters independently.
Architecture
┌──────────────────────────────────────────────────────────────────────┐
│ Training Infrastructure │
│ │
│ ┌─────────┐ ┌──────────────────────────┐ ┌────────────────┐ │
│ │ GRPO / │────▶│ AgentFramework │───▶│ Reward Worker │ │
│ │ PPO │ │ (RolloutAdapter) │ │ compute_score()│ │
│ └─────────┘ └─────────┬──────────────────┘ └────────────────┘ │
│ │ ▲ │
│ _run_session() │ │
│ │ │ │
│ ┌───────────▼───────────┐ │ │
│ │ agent_runner() │ │ │
│ │ (you implement this) │ │ │
│ │ │ │ │
│ │ 1. Parse prompt │ │ │
│ │ 2. Start env (Docker) │ │ │
│ │ 3. Run your agent │ │ │
│ │ 4. Report completion │────────────────────┘ │
│ └────────────────────────┘ (optional: reward_info) │
│ │ │
│ agent calls LLM │
│ │ │
│ ┌─────────▼─────────┐ │
│ │ Gateway │ intercepts every LLM call, │
│ │ │ collects token trajectories, │
│ │ │ routes to vLLM / SGLang │
│ └───────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
The gateway is the key enabler: it sits between the agent and the policy model, making every LLM request observable to the trainer. The agent is unaware of this interception — it simply calls an OpenAI-compatible API endpoint. The trainer uses the collected token sequences (prompts, completions, logprobs) to compute policy gradients and update the model.
Reward Computation: Two Modes
The framework supports two patterns for computing reward:
Mode A — Standard (recommended): The runner only runs the agent and calls complete_session(). Reward evaluation happens in compute_score() on the reward worker.
agent_runner() → runs agent → complete_session()
│
reward worker
│
compute_score() ← evaluates reward
Mode B — In-process (optional): The runner evaluates reward inside itself (e.g., in the same Docker container the agent used), then passes reward_info via complete_session(reward_info={...}). The compute_score() function just reads extra_info["reward_score"]. This avoids spawning a second container for evaluation.
agent_runner() → runs agent → evaluates reward → complete_session(reward_info={...})
│
_score_trajectories()
merges reward_info → extra_info
│
compute_score()
reads extra_info["reward_score"]
Choose Mode B when reward evaluation reuses the same environment the agent already set up. Choose Mode A when reward can be computed independently or when you prefer simpler runner logic.
Key Contracts
| Interface |
Who implements |
Responsibility |
agent_runner() |
You |
Run your agent, optionally compute reward, call complete_session() |
compute_score() |
You |
Compute or extract reward score, return float |
AgentFramework subclass |
You (only for Mode B) |
Override _score_trajectories to merge reward_info into extra_info |
The framework handles everything else: LLM serving, gateway routing, rollout batching, RL advantages, checkpointing.
Integration Guide
Step 1: Write an agent runner
The runner is an async function with this signature:
async def my_agent_runner(
*,
raw_prompt, # str or list[dict] — the task
session: SessionHandle, # contains session_id, base_url (gateway endpoint)
sample_index: int, # sample index in the batch
session_runtime: SessionRuntime, # call complete_session() when done
tools_kwargs: dict | None = None, # per-sample config from dataset
**kwargs, # any extra runner_kwargs from training config
) -> None:
Your runner must:
- Parse the prompt —
raw_prompt is a string or chat message list.
- Create an environment — typically a Docker container with the task setup.
- Point your agent's LLM at the gateway — use
session.base_url as the OpenAI-compatible API endpoint. The agent treats it as a standard LLM API; the gateway handles interception transparently.
- Run your agent — call its existing loop, unmodified.
- Report completion — call
session_runtime.complete_session(session.session_id) (Mode A) or session_runtime.complete_session(session.session_id, reward_info={...}) (Mode B).
Error handling: If the runner fails, call complete_session(reward_info={"reward_score": 0.0}) before re-raising, so the framework doesn't hang.
Step 2: Write compute_score
Mode A (standard): Implement your full reward logic here.
def compute_score(data_source: str, solution_str: str, ground_truth: str, extra_info=None) -> float:
# Full reward evaluation logic
# e.g., run tests, compare outputs, check patches
score = evaluate_submission(solution_str, ground_truth)
return float(score)
Mode B (in-process reward): Just read the pre-computed score.
def compute_score(data_source: str, solution_str: str, ground_truth: str, extra_info=None) -> float:
if extra_info and "reward_score" in extra_info:
return float(extra_info["reward_score"])
return 0.0
Step 3: (Optional, Mode B only) Subclass AgentFramework
If your runner passes reward_info via complete_session(), inject it into extra_info:
from uni_agent.trainer.framework.framework import OpenAICompatibleAgentFramework
class MyFramework(OpenAICompatibleAgentFramework):
async def _score_trajectories(self, session_trajectories, sample_fields):
if session_trajectories and session_trajectories[-1].reward_info:
reward_info = session_trajectories[-1].reward_info
extra_info = dict(sample_fields.get("extra_info") or {})
sample_fields = {**sample_fields, "extra_info": {**extra_info, **reward_info}}
return await super()._score_trajectories(session_trajectories, sample_fields)
Step 4: Write training config
actor_rollout_ref:
rollout:
multi_turn:
enable: true
custom:
agent_framework:
agent_loop_manager_class: uni_agent.trainer.framework.entry.AgentFrameworkRolloutAdapter
framework_class_fqn: my_recipe.framework.MyFramework
agent_runner_fqn: my_recipe.runner.my_agent_runner
completion_timeout_seconds: 600
agent_runner_kwargs: {}
Step 5: Prepare dataset
Parquet format with columns:
| Column |
Type |
Description |
prompt |
str or list[dict] |
Task description / chat messages |
agent_name |
str |
Agent identifier (metadata) |
extra_info |
dict |
Must contain tools_kwargs with env and reward config |
Example extra_info.tools_kwargs:
{
"env": {
"image": "my-task-image:latest",
"post_setup_cmd": "cd /testbed && git checkout abc123"
},
"reward": {
"name": "swe_bench",
"metadata": {"instance_id": "repo__id-123", "patch": "diff --git ..."}
}
}
Step 6: Launch training
python3 -m verl.trainer.main_ppo \
--config-name=my_recipe \
--config-path=my_recipe/config \
actor_rollout_ref.model.path=/path/to/model \
data.train_files="['/path/to/train.parquet']" \
...
Reference Implementations
Runner built with uni-agent components
Uses uni-agent's built-in AgentInteraction loop with OpenAICompatibleChatModel pointing at the gateway. Demonstrates:
AgentEnv + AgentEnvConfig for Docker sandbox management
ToolsManager for tool call parsing
- In-process reward evaluation via
evaluate_in_env() (Mode B)
Files: examples/swe_agent_blackbox/agent_runner.py
Runner wrapping mini-swe-agent (third-party)
Wraps minisweagent's DefaultAgent + DockerEnvironment + LitellmModel. Demonstrates:
- Adapting a sync third-party agent to the async runner interface
DockerEnvForReward adapter bridging sync DockerEnvironment to async reward spec interface
- Running the agent in a thread executor (
loop.run_in_executor)
- In-process reward evaluation (Mode B)
Files: examples/swe_agent_blackbox/mini_swe_agent_runner.py
Shared reward infrastructure
reward.py — build_reward_context(), compute_score(), evaluate_in_env()
framework.py — SWEAgentFramework subclass injecting reward_info into extra_info
parallel_infer.py — Standalone inference runner (no training, just agent + scoring)
Runner Checklist
When integrating a new agent, verify:
| Item |
Check |
| Runner signature |
async def runner(*, raw_prompt, session, sample_index, session_runtime, tools_kwargs=None, **kwargs) -> None |
| LLM routing |
Agent's LLM client points at session.base_url (gateway) |
| Completion |
await session_runtime.complete_session(...) called on success and failure |
| Reward |
Either implement compute_score() (Mode A) or pass reward_info["reward_score"] (Mode B) |
| Cleanup |
Docker/env resources cleaned up in finally block |
| Env config |
DeployConfig discriminator type field present (default: "local") |
Limitations & Future Work
- Single-turn reward only: Current flow computes reward once after the agent finishes. Token-level or step-level reward is not yet supported.
- No trajectory visibility during training: The framework sees token sequences through the gateway but not the agent's tool calls or intermediate states. This is by design (black box) but limits some training approaches.
- Gateway coupling: The agent's LLM must use an OpenAI-compatible API and point at the gateway. Agents with custom LLM backends need adaptation.
- Sync agent overhead: Third-party sync agents (like mini-swe-agent) run via
run_in_executor, which blocks a thread per session. For high concurrency, async-native agents perform better.
Summary
This RFC proposes a generic recipe for integrating arbitrary third-party agents into the uni-agent training pipeline as black boxes. The training infrastructure treats the agent as an opaque entity that communicates with the policy model solely through a gateway. The gateway intercepts every LLM call, collects token-level trajectories, and feeds them to the RL trainer — all without any knowledge of the agent's internal control flow (tool orchestration, prompting strategy, state management).
Two reference implementations are provided: one built with uni-agent components, and one wrapping the third-party mini-swe-agent.
Motivation
Many mature agent frameworks (OpenHands, SWE-agent, mini-swe-agent, etc.) already have well-tuned interaction loops, tool integrations, and prompting strategies. Rewriting them to fit a specific training framework is costly and fragile.
A blackbox approach solves this: plug any agent into the training pipeline, and the gateway-transparent architecture ensures the RL trainer can observe and optimize every LLM call while the agent logic stays untouched.
This enables:
Architecture
The gateway is the key enabler: it sits between the agent and the policy model, making every LLM request observable to the trainer. The agent is unaware of this interception — it simply calls an OpenAI-compatible API endpoint. The trainer uses the collected token sequences (prompts, completions, logprobs) to compute policy gradients and update the model.
Reward Computation: Two Modes
The framework supports two patterns for computing reward:
Mode A — Standard (recommended): The runner only runs the agent and calls
complete_session(). Reward evaluation happens incompute_score()on the reward worker.Mode B — In-process (optional): The runner evaluates reward inside itself (e.g., in the same Docker container the agent used), then passes
reward_infoviacomplete_session(reward_info={...}). Thecompute_score()function just readsextra_info["reward_score"]. This avoids spawning a second container for evaluation.Choose Mode B when reward evaluation reuses the same environment the agent already set up. Choose Mode A when reward can be computed independently or when you prefer simpler runner logic.
Key Contracts
agent_runner()complete_session()compute_score()AgentFrameworksubclass_score_trajectoriesto mergereward_infointoextra_infoThe framework handles everything else: LLM serving, gateway routing, rollout batching, RL advantages, checkpointing.
Integration Guide
Step 1: Write an agent runner
The runner is an async function with this signature:
Your runner must:
raw_promptis a string or chat message list.session.base_urlas the OpenAI-compatible API endpoint. The agent treats it as a standard LLM API; the gateway handles interception transparently.session_runtime.complete_session(session.session_id)(Mode A) orsession_runtime.complete_session(session.session_id, reward_info={...})(Mode B).Error handling: If the runner fails, call
complete_session(reward_info={"reward_score": 0.0})before re-raising, so the framework doesn't hang.Step 2: Write compute_score
Mode A (standard): Implement your full reward logic here.
Mode B (in-process reward): Just read the pre-computed score.
Step 3: (Optional, Mode B only) Subclass AgentFramework
If your runner passes
reward_infoviacomplete_session(), inject it intoextra_info:Step 4: Write training config
Step 5: Prepare dataset
Parquet format with columns:
promptstrorlist[dict]agent_namestrextra_infodicttools_kwargswith env and reward configExample
extra_info.tools_kwargs:{ "env": { "image": "my-task-image:latest", "post_setup_cmd": "cd /testbed && git checkout abc123" }, "reward": { "name": "swe_bench", "metadata": {"instance_id": "repo__id-123", "patch": "diff --git ..."} } }Step 6: Launch training
python3 -m verl.trainer.main_ppo \ --config-name=my_recipe \ --config-path=my_recipe/config \ actor_rollout_ref.model.path=/path/to/model \ data.train_files="['/path/to/train.parquet']" \ ...Reference Implementations
Runner built with uni-agent components
Uses uni-agent's built-in
AgentInteractionloop withOpenAICompatibleChatModelpointing at the gateway. Demonstrates:AgentEnv+AgentEnvConfigfor Docker sandbox managementToolsManagerfor tool call parsingevaluate_in_env()(Mode B)Files:
examples/swe_agent_blackbox/agent_runner.pyRunner wrapping mini-swe-agent (third-party)
Wraps
minisweagent'sDefaultAgent+DockerEnvironment+LitellmModel. Demonstrates:DockerEnvForRewardadapter bridging sync DockerEnvironment to async reward spec interfaceloop.run_in_executor)Files:
examples/swe_agent_blackbox/mini_swe_agent_runner.pyShared reward infrastructure
reward.py—build_reward_context(),compute_score(),evaluate_in_env()framework.py—SWEAgentFrameworksubclass injectingreward_infointoextra_infoparallel_infer.py— Standalone inference runner (no training, just agent + scoring)Runner Checklist
When integrating a new agent, verify:
async def runner(*, raw_prompt, session, sample_index, session_runtime, tools_kwargs=None, **kwargs) -> Nonesession.base_url(gateway)await session_runtime.complete_session(...)called on success and failurecompute_score()(Mode A) or passreward_info["reward_score"](Mode B)finallyblockDeployConfigdiscriminatortypefield present (default:"local")Limitations & Future Work
run_in_executor, which blocks a thread per session. For high concurrency, async-native agents perform better.