Tau2 Green Agent evaluates purple agents on the tau2-bench customer service tasks and returns a structured artifact compatible with AgentBeats.
ghcr.io/shikibuton10x/tau2-green-agent:latestghcr.io/shikibuton10x/tau2-green-agent:v0.1.0
docker pull ghcr.io/shikibuton10x/tau2-green-agent:latestFrom repo root (where compose.yaml is):
echo 'OPENAI_API_KEY=your-key' > .env
git clone --depth 1 https://github.com/sierra-research/tau2-bench.git tau2-bench
docker compose up --build -d
curl -sf http://localhost:9009/.well-known/agent-card.json > /dev/null
python3 - <<'PY' | curl -s http://localhost:9009/ -H 'Content-Type: application/json' -d @-
import json
rpc = {
"jsonrpc": "2.0",
"id": "1",
"method": "message/send",
"params": {
"message": {
"kind": "message",
"role": "user",
"messageId": "m1",
"contextId": "c1",
"parts": [
{"kind": "text", "text": json.dumps({"participants": {"agent": "http://purple:9019"}, "config": {"domain": "mock", "num_tasks": 1}})}
],
}
}
}
print(json.dumps(rpc))
PY
docker compose downThe Result artifact should include keys like pass_rate, time_used, summary, and tasks (pass_rate may be 0; this is an infra/E2E wiring check):
{
"pass_rate": 0.0,
"time_used": 12.34,
"summary": {
"pass_rate": 0.0,
"passed": 0,
"total": 1,
"time_used_sec": 12.34
},
"tasks": [
{
"task_id": "mock-...",
"passed": false,
"reward": 0,
"duration_sec": 12.34
}
]
}- Green:
9009(host) - Purple:
9019(host)
Host check:
curl -s http://localhost:9009/.well-known/agent-card.json- Python:
>=3.13(seepyproject.toml) - Package manager:
uv - Pinning strategy:
- This project uses
uv.lockas the source of truth for exact versions. - Local installs should prefer frozen sync:
uv sync --frozen(and--extra testif needed). - Docker builds copy
pyproject.toml+uv.lockand runuv sync --frozen(seeDockerfile).
- This project uses
- tau2-bench dependency:
- The
tau2package is installed from thesierra-research/tau2-benchrepo pinned to a specific git commit viapyproject.toml:337326e62d8e0ca74c353b004a9c5d748e0ba914
- The same pin is recorded in
uv.lock.
- The
If you change dependencies, regenerate the lockfile (uv lock) before building/testing.
- mock
- airline
- retail
- telecom
Send a JSON message to the agent with participants and config:
{
"participants": {"agent": "http://localhost:9019"},
"config": {
"domain": "mock",
"num_tasks": 1,
"seed": 0,
"timeout_seconds": 300,
"max_steps": 50,
"retries": 2
}
}- domain: "mock"
- num_tasks: 1 (range 1..50)
- seed: 0
- timeout_seconds: 300 (per task)
- max_steps: 50
- retries: 2 (range 0..5)
Optional:
- task_ids: list of task ids
- user_llm: default "openai/gpt-4.1"
- user_llm_args: default
{ "temperature": 0.0 }
Invalid values return an A2A rejection with a clear error message.
The artifact keeps backward-compatible keys and adds structured diagnostics:
- pass_rate (float)
- time_used (float, seconds)
- task_rewards (dict task_id -> reward)
- summary: { pass_rate, passed, total, time_used_sec }
- config: { domain, num_tasks, seed, timeout_seconds, max_steps, retries }
- tasks: list of { task_id, passed, reward, duration_sec, turns, tool_calls, failure_reason, error }
- system: { green_agent_version, tau2_bench_version }
Prerequisites:
uvOPENAI_API_KEY- tau2-bench data directory (set
TAU2_DATA_DIR)
# install deps
uv sync --frozen --extra test
# run the server
export TAU2_DATA_DIR="$(pwd)/tau2-bench/data"
export OPENAI_API_KEY=your-key
uv run src/server.py --host 127.0.0.1 --port 9009 --card-url http://localhost:9009/
# agent card
curl -s http://localhost:9009/.well-known/agent-card.jsonStart the purple agent (baseline from agentbeats-tutorial):
export OPENAI_API_KEY=your-key
uv run agentbeats-tutorial/scenarios/tau2/tau2_agent.py --host 127.0.0.1 --port 9019Send an EvalRequest:
python3 - <<'PY' | curl -s http://localhost:9009/ -H 'Content-Type: application/json' -d @-
import json
rpc = {
"jsonrpc": "2.0",
"id": "1",
"method": "message/send",
"params": {
"message": {
"kind": "message",
"role": "user",
"messageId": "m1",
"contextId": "c1",
"parts": [
{"kind": "text", "text": json.dumps({"participants": {"agent": "http://localhost:9019"}, "config": {"domain": "mock", "num_tasks": 1}})}
],
}
}
}
print(json.dumps(rpc))
PYBuild and run the green agent with data mounted:
docker build -t tau2-green-agent:local -f Dockerfile .
# mount tau2-bench data and provide OPENAI_API_KEY
docker run --rm -p 9009:9009 \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
-e TAU2_DATA_DIR=/data \
-v $(pwd)/tau2-bench/data:/data:ro \
tau2-green-agent:local --host 0.0.0.0 --port 9009 --card-url http://localhost:9009/When Green runs in Docker and Purple runs on the host, use host.docker.internal.
Also, Purple must advertise a container-reachable agent-card.url via --card-url.
# 1) Purple (host)
git clone --depth 1 https://github.com/RDI-Foundation/agentbeats-tutorial.git agentbeats-tutorial
cd agentbeats-tutorial
export OPENAI_API_KEY=your-key
uv run python scenarios/tau2/tau2_agent.py \
--host 0.0.0.0 --port 9019 \
--card-url http://host.docker.internal:9019/
# 2) Green (docker)
cd ..
git clone --depth 1 https://github.com/sierra-research/tau2-bench.git tau2-bench
docker build -t tau2-green-agent:local -f Dockerfile .
docker run -d --rm --name tau2-green-dockerhost \
-p 9009:9009 \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
-e TAU2_DATA_DIR=/data \
-v "$(pwd)/tau2-bench/data:/data:ro" \
tau2-green-agent:local \
--host 0.0.0.0 --port 9009 --card-url http://localhost:9009/
# 3) E2E EvalRequest (host -> green)
python3 - <<'PY'
import asyncio, json
import httpx
rpc = {
"jsonrpc": "2.0",
"id": "1",
"method": "message/send",
"params": {
"message": {
"kind": "message",
"role": "user",
"messageId": "m1",
"contextId": "c1",
"parts": [
{"kind": "text", "text": json.dumps({"participants": {"agent": "http://host.docker.internal:9019"}, "config": {"domain": "mock", "num_tasks": 1}})}
],
}
}
}
async def main():
async with httpx.AsyncClient(timeout=300) as c:
r = await c.post("http://localhost:9009/", json=rpc)
r.raise_for_status()
print(r.text[:2000])
asyncio.run(main())
PY
docker stop tau2-green-dockerhostFrom repo root:
export OPENAI_API_KEY=your-key
export TAU2_DATA_DIR=$(pwd)/tau2-bench/data
docker compose up --builduv sync --frozen --extra test
uv run pytest- Missing API key: set
OPENAI_API_KEY(Docker Compose uses.envin repo root). - Port conflict: stop other processes or change ports (
9009/9019). - 503 / cannot reach Purple: when running via Docker Compose, the EvalRequest must use
"participants":{"agent":"http://purple:9019"}(service name), and the agent-card URLs must be reachable from the caller (host vs compose network).