Skip to content

[RFC]blackbox agent integration #34

@zhaizhiqiangA

Description

@zhaizhiqiangA

Summary

This RFC proposes a generic recipe for integrating arbitrary third-party agents into the uni-agent training pipeline as black boxes. The training infrastructure treats the agent as an opaque entity that communicates with the policy model solely through a gateway. The gateway intercepts every LLM call, collects token-level trajectories, and feeds them to the RL trainer — all without any knowledge of the agent's internal control flow (tool orchestration, prompting strategy, state management).

Two reference implementations are provided: one built with uni-agent components, and one wrapping the third-party mini-swe-agent.

Motivation

Many mature agent frameworks (OpenHands, SWE-agent, mini-swe-agent, etc.) already have well-tuned interaction loops, tool integrations, and prompting strategies. Rewriting them to fit a specific training framework is costly and fragile.

A blackbox approach solves this: plug any agent into the training pipeline, and the gateway-transparent architecture ensures the RL trainer can observe and optimize every LLM call while the agent logic stays untouched.

This enables:

  • Zero-cost agent migration: bring your existing agent, write a thin runner adapter, start training.
  • Agent-agnostic training: swap between different agent implementations without changing training config.
  • Decoupled iteration: improve agent logic and training hyperparameters independently.

Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                        Training Infrastructure                        │
│                                                                       │
│  ┌─────────┐     ┌──────────────────────────┐    ┌────────────────┐  │
│  │  GRPO /  │────▶│    AgentFramework         │───▶│  Reward Worker │  │
│  │  PPO     │     │  (RolloutAdapter)          │    │ compute_score()│  │
│  └─────────┘     └─────────┬──────────────────┘    └────────────────┘  │
│                             │                               ▲           │
│                      _run_session()                          │           │
│                             │                                │           │
│                 ┌───────────▼───────────┐                    │           │
│                 │   agent_runner()      │                    │           │
│                 │ (you implement this)   │                    │           │
│                 │                        │                    │           │
│                 │  1. Parse prompt       │                    │           │
│                 │  2. Start env (Docker)  │                    │           │
│                 │  3. Run your agent     │                    │           │
│                 │  4. Report completion  │────────────────────┘           │
│                 └────────────────────────┘  (optional: reward_info)      │
│                             │                                           │
│                     agent calls LLM                                     │
│                             │                                           │
│                   ┌─────────▼─────────┐                                 │
│                   │     Gateway       │  intercepts every LLM call,      │
│                   │                   │  collects token trajectories,    │
│                   │                   │  routes to vLLM / SGLang         │
│                   └───────────────────┘                                 │
└──────────────────────────────────────────────────────────────────────┘

The gateway is the key enabler: it sits between the agent and the policy model, making every LLM request observable to the trainer. The agent is unaware of this interception — it simply calls an OpenAI-compatible API endpoint. The trainer uses the collected token sequences (prompts, completions, logprobs) to compute policy gradients and update the model.

Reward Computation: Two Modes

The framework supports two patterns for computing reward:

Mode A — Standard (recommended): The runner only runs the agent and calls complete_session(). Reward evaluation happens in compute_score() on the reward worker.

agent_runner() → runs agent → complete_session()
                                          │
                                     reward worker
                                          │
                               compute_score() ← evaluates reward

Mode B — In-process (optional): The runner evaluates reward inside itself (e.g., in the same Docker container the agent used), then passes reward_info via complete_session(reward_info={...}). The compute_score() function just reads extra_info["reward_score"]. This avoids spawning a second container for evaluation.

agent_runner() → runs agent → evaluates reward → complete_session(reward_info={...})
                                                                    │
                                                          _score_trajectories()
                                                          merges reward_info → extra_info
                                                                    │
                                                          compute_score()
                                                          reads extra_info["reward_score"]

Choose Mode B when reward evaluation reuses the same environment the agent already set up. Choose Mode A when reward can be computed independently or when you prefer simpler runner logic.

Key Contracts

Interface Who implements Responsibility
agent_runner() You Run your agent, optionally compute reward, call complete_session()
compute_score() You Compute or extract reward score, return float
AgentFramework subclass You (only for Mode B) Override _score_trajectories to merge reward_info into extra_info

The framework handles everything else: LLM serving, gateway routing, rollout batching, RL advantages, checkpointing.

Integration Guide

Step 1: Write an agent runner

The runner is an async function with this signature:

async def my_agent_runner(
    *,
    raw_prompt,                          # str or list[dict] — the task
    session: SessionHandle,              # contains session_id, base_url (gateway endpoint)
    sample_index: int,                   # sample index in the batch
    session_runtime: SessionRuntime,     # call complete_session() when done
    tools_kwargs: dict | None = None,    # per-sample config from dataset
    **kwargs,                            # any extra runner_kwargs from training config
) -> None:

Your runner must:

  1. Parse the promptraw_prompt is a string or chat message list.
  2. Create an environment — typically a Docker container with the task setup.
  3. Point your agent's LLM at the gateway — use session.base_url as the OpenAI-compatible API endpoint. The agent treats it as a standard LLM API; the gateway handles interception transparently.
  4. Run your agent — call its existing loop, unmodified.
  5. Report completion — call session_runtime.complete_session(session.session_id) (Mode A) or session_runtime.complete_session(session.session_id, reward_info={...}) (Mode B).

Error handling: If the runner fails, call complete_session(reward_info={"reward_score": 0.0}) before re-raising, so the framework doesn't hang.

Step 2: Write compute_score

Mode A (standard): Implement your full reward logic here.

def compute_score(data_source: str, solution_str: str, ground_truth: str, extra_info=None) -> float:
    # Full reward evaluation logic
    # e.g., run tests, compare outputs, check patches
    score = evaluate_submission(solution_str, ground_truth)
    return float(score)

Mode B (in-process reward): Just read the pre-computed score.

def compute_score(data_source: str, solution_str: str, ground_truth: str, extra_info=None) -> float:
    if extra_info and "reward_score" in extra_info:
        return float(extra_info["reward_score"])
    return 0.0

Step 3: (Optional, Mode B only) Subclass AgentFramework

If your runner passes reward_info via complete_session(), inject it into extra_info:

from uni_agent.trainer.framework.framework import OpenAICompatibleAgentFramework

class MyFramework(OpenAICompatibleAgentFramework):
    async def _score_trajectories(self, session_trajectories, sample_fields):
        if session_trajectories and session_trajectories[-1].reward_info:
            reward_info = session_trajectories[-1].reward_info
            extra_info = dict(sample_fields.get("extra_info") or {})
            sample_fields = {**sample_fields, "extra_info": {**extra_info, **reward_info}}
        return await super()._score_trajectories(session_trajectories, sample_fields)

Step 4: Write training config

actor_rollout_ref:
  rollout:
    multi_turn:
      enable: true
    custom:
      agent_framework:
        agent_loop_manager_class: uni_agent.trainer.framework.entry.AgentFrameworkRolloutAdapter
        framework_class_fqn: my_recipe.framework.MyFramework
        agent_runner_fqn: my_recipe.runner.my_agent_runner
        completion_timeout_seconds: 600
        agent_runner_kwargs: {}

Step 5: Prepare dataset

Parquet format with columns:

Column Type Description
prompt str or list[dict] Task description / chat messages
agent_name str Agent identifier (metadata)
extra_info dict Must contain tools_kwargs with env and reward config

Example extra_info.tools_kwargs:

{
  "env": {
    "image": "my-task-image:latest",
    "post_setup_cmd": "cd /testbed && git checkout abc123"
  },
  "reward": {
    "name": "swe_bench",
    "metadata": {"instance_id": "repo__id-123", "patch": "diff --git ..."}
  }
}

Step 6: Launch training

python3 -m verl.trainer.main_ppo \
    --config-name=my_recipe \
    --config-path=my_recipe/config \
    actor_rollout_ref.model.path=/path/to/model \
    data.train_files="['/path/to/train.parquet']" \
    ...

Reference Implementations

Runner built with uni-agent components

Uses uni-agent's built-in AgentInteraction loop with OpenAICompatibleChatModel pointing at the gateway. Demonstrates:

  • AgentEnv + AgentEnvConfig for Docker sandbox management
  • ToolsManager for tool call parsing
  • In-process reward evaluation via evaluate_in_env() (Mode B)

Files: examples/swe_agent_blackbox/agent_runner.py

Runner wrapping mini-swe-agent (third-party)

Wraps minisweagent's DefaultAgent + DockerEnvironment + LitellmModel. Demonstrates:

  • Adapting a sync third-party agent to the async runner interface
  • DockerEnvForReward adapter bridging sync DockerEnvironment to async reward spec interface
  • Running the agent in a thread executor (loop.run_in_executor)
  • In-process reward evaluation (Mode B)

Files: examples/swe_agent_blackbox/mini_swe_agent_runner.py

Shared reward infrastructure

  • reward.pybuild_reward_context(), compute_score(), evaluate_in_env()
  • framework.pySWEAgentFramework subclass injecting reward_info into extra_info
  • parallel_infer.py — Standalone inference runner (no training, just agent + scoring)

Runner Checklist

When integrating a new agent, verify:

Item Check
Runner signature async def runner(*, raw_prompt, session, sample_index, session_runtime, tools_kwargs=None, **kwargs) -> None
LLM routing Agent's LLM client points at session.base_url (gateway)
Completion await session_runtime.complete_session(...) called on success and failure
Reward Either implement compute_score() (Mode A) or pass reward_info["reward_score"] (Mode B)
Cleanup Docker/env resources cleaned up in finally block
Env config DeployConfig discriminator type field present (default: "local")

Limitations & Future Work

  • Single-turn reward only: Current flow computes reward once after the agent finishes. Token-level or step-level reward is not yet supported.
  • No trajectory visibility during training: The framework sees token sequences through the gateway but not the agent's tool calls or intermediate states. This is by design (black box) but limits some training approaches.
  • Gateway coupling: The agent's LLM must use an OpenAI-compatible API and point at the gateway. Agents with custom LLM backends need adaptation.
  • Sync agent overhead: Third-party sync agents (like mini-swe-agent) run via run_in_executor, which blocks a thread per session. For high concurrency, async-native agents perform better.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions