| title | ComtradeBench: An OpenEnv Benchmark for Reliable LLM Tool-Use Under Adversarial API Conditions | ||||||
|---|---|---|---|---|---|---|---|
| emoji | 📊 | ||||||
| colorFrom | indigo | ||||||
| colorTo | gray | ||||||
| sdk | docker | ||||||
| pinned | false | ||||||
| app_port | 8000 | ||||||
| base_path | /web | ||||||
| tags |
|
What happens when Kimi, Claude, GPT-5, Llama, and open-source Qwen2.5-7B all run the same 10 API tasks against the same seeded environment and the same deterministic 6-dimensional judge?
- Kimi and Claude produce numerically identical scores on every single task — not close, identical (97.5 avg each).
- A mid-size instruction-tuned open model (Qwen2.5-7B-Instruct), zero-shot (no training), scores 97.2 — within 0.3 points of closed-source frontier. Note: Llama 3.3 70B (also open-source) scores only 89.3, so the relevant axis is instruction quality + size at the 7B class, not "open vs closed" as a crude dichotomy.
- GPT-5 scores 21.8 points lower than frontier on one specific task — the one with mid-episode fault escalation. Not because it's less capable, but because it reasons for 223 seconds across 2 tool calls instead of executing 7 tool calls in 8 seconds.
- Llama is bimodal on that same task, spanning 18.7 to 97.5 across seeds — the discriminative signal is reliability, not capability.
And when we try to GRPO-train a Qwen2.5-3B on the benchmark? It enters the learning window at iter 3, trains cleanly for 14 iterations, then policy-collapses at iter 15 into a degenerate output region.
ComtradeBench surfaces these failure modes because it measures execution reliability — not correctness, not reasoning, not fluency. The benchmark is adversarial by design: fault injection, non-stationary dynamics, and multi-dimensional scoring that reward agents who do the job right, not agents who return something plausible.
AgentBeats Phase 2 — OpenEnv Challenge | Author: MateFin GitHub · Env Space · Blog
For judges — 30-second summary:
- Ten-task OpenEnv benchmark for LLM agent reliability under adversarial API conditions (429/500, pagination drift, duplicates, totals traps, within-episode fault escalation, constrained budgets).
- Five LLMs evaluated cross-model: Kimi Moonshot V1-128k, Claude Sonnet 4.6, open-source Qwen2.5-7B-Instruct (zero-shot), GPT-5, Llama 3.3 70B. Plus three Qwen2.5 sizes trained with GRPO (1.5B full-param, 3B + LoRA, 7B + LoRA).
- Four independent findings: (1) T9 separates execution-oriented from reasoning-oriented frontier (Kimi/Claude 97.5 vs GPT-5 75.7), (2) Kimi = Claude numerically identical → ceiling saturation, (3) Llama T9 bimodal → sub-frontier is about reliability not capability, (4) ⭐ A mid-size instruction-tuned open model (Qwen2.5-7B, zero-shot) matches closed-source frontier (97.2 vs 97.5) — benchmark solvable by the 7B instruction-tuned class without any training. Note: this is not a blanket "open vs closed" claim — Llama 3.3 70B (also open) scores only 89.3.
- GRPO operating envelope mapped at three points (under-capacity / learn-then-collapse / saturation) — an actionable finding, not a "we trained something" claim.
- Results live and reproducible in the HF Docker Space.
Most API-task benchmarks (ToolBench, τ-bench, BFCL, API-Bank) evaluate whether an agent retrieves the correct answer from a clean API. Production APIs are rarely clean. Rate-limiters fire, pages reshuffle, duplicates appear, totals rows contaminate aggregates, request budgets bite. A pretrained LLM that nails the clean benchmark can still break in production because the execution behaviour was never tested — it was optimised for right answers, not for handling adversity.
We built ComtradeBench to close that gap. The adversarial bits are in the environment, not in the prompts or the labels, so an agent cannot route around them by rephrasing. The scoring is six-dimensional (correctness, completeness, robustness, efficiency, data quality, observability) so fluent-looking output from broken execution gets penalised on the five dimensions it fails.
The ten tasks cover pagination, deduplication, 429 / 500 retries, non-deterministic page ordering, totals-row filtering, mixed-fault combinations, within-episode fault escalation (T9) where the environment gets harder as the agent makes progress, and constrained request budgets (T10) where the agent has half the normal quota. The rule-based baseline scores 96.8 / 100 — a ceiling that a competent rule-following agent should clear but that we found is a non-trivial target for LLMs without careful prompting.
| Agent | Avg (T1-T10) | T9 | Notes |
|---|---|---|---|
| Rule-based baseline | 96.8 | 96.9 | deterministic, no LLM |
| Kimi Moonshot V1-128k | 97.5 | 97.5 | closed-source frontier, multi-seed std = 0.0 |
| Claude Sonnet 4.6 | 97.5 | 97.5 | closed-source frontier, identical to Kimi |
| Qwen2.5-7B-Instruct ⭐ | 97.2 | 97.5 | open-source, zero-shot (no fine-tuning, no training) |
| GPT-5 | 93.2 | 75.7 | reasoning-oriented: 2 steps in 223 s vs Kimi's 7 steps in 8 s |
| Llama 3.3 70B (Groq) | 89.3 | 18.7–97.5† | bimodal across seeds |
† Llama T9 is bimodal: the published seed-42 run hit 18.7, multi-seed re-run produced {97.5, 94.5, …} — the low number and the near-frontier numbers both reproduce. Raw per-seed data in multiseed_llama_t9_summary.json.
⭐ Mid-size instruction-tuned model parity: Qwen2.5-7B-Instruct, run zero-shot (no training, no fine-tuning), via Together AI → 97.2 / 100 avg, 97.5 on T9 — within 0.3 points of closed-source frontier (Kimi, Claude), above baseline, and above GPT-5 by 4.0 points. Llama 3.3 70B (also open-source) scores only 89.3, so the finding is not "open-source matches closed-source" as a blanket claim — it's "a strongly instruction-tuned 7B is enough for this benchmark". The relevant axis is instruction quality + size class, not licensing. llm_results_qwen7b_zeroshot.json.
- T9 separates execution-oriented from reasoning-oriented frontier. Kimi and Claude execute T9 in ~8 s across 7 tool calls and score 97.5. GPT-5 "thinks" for 223 s across 2 tool calls and scores 75.7 — a 21.8-point gap between frontier models that a pass/fail benchmark would miss entirely. The breakdown tells the story: GPT-5's Efficiency drops to 6/15 (budget burned in reasoning-time) and Observability to ~4/10 (2 steps leave no audit trail).
- Frontier saturates at the top. Kimi and Claude produce numerically identical per-task scores on all 10 tasks. Same seeded environment, same deterministic judge, same solve-path → same score.
- Sub-frontier is high-variance, not uniformly weak. Kimi T9 std = 0.0 across 5 seeds. Llama T9 spans 18.7 – 97.5. The discriminative signal is reliability, not capability: Llama can sometimes match frontier, just not consistently.
- ⭐ Mid-size instruction-tuned 7B closes the gap to frontier without training. Qwen2.5-7B-Instruct, zero-shot (no fine-tuning, no GRPO), scores 97.2 — within 0.3 points of Kimi/Claude, above GPT-5 and Llama 3.3 70B. This is not a blanket "open-source matches closed-source" claim: Llama 3.3 70B (open) scores only 89.3, and GPT-5 (closed) scores 93.2. The axis that matters is instruction-tuning quality at the 7B size class, not licensing. This finding reframes what the benchmark measures: not "can a closed frontier LLM solve this" but "can an execution-oriented agent do the job reliably", and a strongly instruction-tuned 7B clears that bar. It also validates the GRPO saturation finding — 7B genuinely is at ceiling for this benchmark, which is why GRPO fine-tuning provides no gradient signal.
Three training configurations, three distinct failure modes. 1.5B full-param: under-capacity, reward oscillates 0.22–0.94 with no trend. 3B + LoRA: learns cleanly for 14 iterations (KL grows monotonically 8e-6 → 5.6e-4), then policy-collapses at iter 15. 7B + LoRA: mean reward 0.987 at iter 1, already above baseline — GRPO advantage signal near zero, no gradient propagates.
Reading the envelope: the useful GRPO training band exists (iters 3-14 of the 3B run are empirical proof — real reward variance, monotonically growing KL), but it is narrow and fragile. Stable training on the 3B point requires adaptive KL penalty, tighter trust-region clipping, or early-stop on reward-variance collapse — engineering work we did not perform in this release. This is a more actionable finding than "training converged on some model": it names concrete failure modes a practitioner would hit.
All training data is committed as artifacts: grpo_gradient_training.jsonl (1.5B per-iter metrics), grpo_gradient_training_3b.jsonl (3B per-iter, 15 entries), grpo_3b_lora_collapse.json (3B interpretation), grpo_7b_lora_5iter_saturation.json (7B interpretation). The same environment code runs in-process during GRPO rollouts and as the deployed Docker service during eval — zero divergence.
Most API-task benchmarks evaluate whether an agent retrieves the correct answer from a clean API. ComtradeBench evaluates whether the agent executes correctly when the API actively resists correct execution:
T3,T8: cross-page duplicate records can overcount rows and inflate trade totals.T4,T8: HTTP 429 rate limits can create missing pages if the agent advances too early.T5: HTTP 500 transient failures can leave silent data gaps when retry is skipped.T6: non-deterministic page ordering breaks agents that assume stable row position.T7: synthetic totals rows (is_total=true) contaminate aggregates unless filtered.T9: adaptive fault escalation tests whether policy still holds under mid-episode shift.T10: a halved request budget exposes redundant fetches and incomplete retrieval plans.
The agent has three MCP tools and 100 requests. The six-dimensional judge scores correctness, completeness, robustness, efficiency, data quality, and observability. There is no partial credit for correct-sounding output from an incorrect execution.
comtrade_env/
├── README.md # This file
├── blog_post.md # Submission blog post
├── openenv.yaml # OpenEnv manifest
├── pyproject.toml # Environment dependencies
├── Dockerfile # Container image
├── __init__.py # Module exports
├── client.py # ComtradeEnv HTTP/WebSocket client
├── models.py # ComtradeAction / ComtradeObservation
├── server/ # Environment + mock service
│ ├── app.py # FastAPI app (HTTP + WebSocket)
│ ├── comtrade_env_environment.py # Core MCP environment logic
│ ├── tasks.py # Task definitions (T1–T10)
│ ├── judge.py # Scoring engine (6 dimensions)
│ ├── mock_service/ # Embedded mock Comtrade API
│ │ ├── app.py # FastAPI mock with fault injection
│ │ └── fixtures/ # Ground-truth data (seeded RNG)
│ ├── Dockerfile # Server container image
│ └── requirements.txt
├── green/ # Green Agent (A2A evaluator for AgentBeats)
│ ├── agent_a2a.py # A2A server (JSON-RPC 2.0)
│ ├── judge_green.py # Scoring engine
│ ├── tasks_green.py # Task definitions
│ └── Dockerfile # Green agent container
└── agent/ # LLM training agent
├── agent.py # LLM-powered agentic loop
├── env_client.py # InProcessEnvClient (no HTTP needed)
├── train_grpo.py # GRPO training pipeline
├── smoke_test.py # Rule-based smoke test (no LLM)
├── direct_test.py # Direct environment test
├── inference.py # Inference script
├── plot_training.py # Training curve visualisation
└── tests/
└── test_comtrade.py # Unit + integration tests
| ID | Name | Challenge |
|---|---|---|
| T1 | Single page | Fetch one page, submit. Baseline correctness. |
| T2 | Multi-page pagination | Iterate pages until has_more=False. |
| T3 | Deduplication | Pages overlap; agent must dedup by primary key. |
| T4 | HTTP 429 retry | Rate-limit fault injection; retry without data loss. |
| T5 | HTTP 500 retry | Server error fault; retry transient failures. |
| T6 | Page drift | Non-deterministic page ordering; handle instability. |
| T7 | Totals trap | Summary rows mixed in; drop is_total=true rows. |
| T8 | Mixed faults | 429 rate-limit + cross-page duplicates simultaneously. |
| T9 | Adaptive adversary | Faults escalate mid-episode based on agent progress. |
| T10 | Constrained budget | Single agent runs under halved request budget. |
get_task_info() → task description, query params, request budget
fetch_page(page, page_size) → {rows, page, total_pages, has_more}
submit_results(data_jsonl, metadata_json, run_log) → {reward, score, breakdown}
| Dimension | Weight | What it measures |
|---|---|---|
| Correctness | 30 | All expected rows present and correct |
| Completeness | 15 | No missing records |
| Robustness | 15 | Correct handling of 429/500 faults |
| Efficiency | 15 | Request count relative to minimum needed |
| Data Quality | 15 | No duplicates, no totals rows leaked |
| Observability | 10 | run.log contains required fields |
Note. All commands below assume you
cd comtrade_envfirst — several scripts importmodels/serverby relative path, so the current working directory must be the repo root (or you must exportPYTHONPATH=$(pwd)).
cd comtrade_env
cp .env.example .env
# Edit .env and paste whichever provider key you want (Kimi / Anthropic / Groq / Nebius).
# Smoke tests and the rule-based baseline do NOT need any API keys.cd comtrade_env
# Install OpenEnv framework (if not already)
pip install openenv-core[core]
# Run rule-based agent on one task
python agent/smoke_test.py --task T1_single_page
# Run all tasks
for t in T1_single_page T2_multi_page T3_duplicates \
T4_rate_limit_429 T5_server_error_500 T6_page_drift T7_totals_trap \
T8_mixed_faults T9_adaptive_adversary T10_constrained_budget; do
python agent/smoke_test.py --task $t
donecd comtrade_env
pip install pytest
python -m pytest agent/tests/ -vcd comtrade_env
# Install agent dependencies
pip install torch transformers accelerate peft trl openai requests fastmcp fastapi uvicorn
# Using a local Ollama/vLLM endpoint (rollout-only, no gradient updates)
python agent/train_grpo.py \
--api-url http://localhost:11434/v1 \
--api-model qwen2.5:7b \
--num-iterations 200 \
--batch-size 4 \
--group-size 4
# Using a HuggingFace model (full GRPO training with gradients)
python agent/train_grpo.py \
--hf-model Qwen/Qwen2.5-7B-Instruct \
--num-iterations 200No external OpenEnv server is needed — InProcessEnvClient runs the environment in-process.
The three canonical LLM result files (llm_results_kimi.json, llm_results_claude.json,
llm_results_llama.json) were produced by agent/run_eval.py against the same 10-task suite,
temperature=0.0, seed=42. To regenerate them on your own keys:
cd comtrade_env
cp .env.example .env # fill in the relevant key (see §0)
# Kimi Moonshot V1-128k (international endpoint shown; swap to .cn for China)
python agent/run_eval.py \
--api-url https://api.moonshot.ai/v1 \
--api-model moonshot-v1-128k \
--env-key KIMI_API_KEY \
--label kimi_128k_apples --all
# Claude Sonnet 4.6
python agent/run_eval.py \
--api-url https://api.anthropic.com/v1 \
--api-model claude-sonnet-4-6 \
--env-key ANTHROPIC_API_KEY \
--label claude_sonnet_4_6 --all
# Llama 3.3 70B via Groq
python agent/run_eval.py \
--api-url https://api.groq.com/openai/v1 \
--api-model llama-3.3-70b-versatile \
--env-key GROQ_API_KEY \
--label llama3_3_70b --all
# Ablation condition C (context=128k + EVENTS scratchpad prompt)
python agent/run_eval.py \
--api-url https://api.moonshot.ai/v1 \
--api-model moonshot-v1-128k \
--env-key KIMI_API_KEY \
--label kimi_ablation_events_enhanced \
--prompt-file agent/prompts/enhanced_events.txt \
--tasks T4_rate_limit_429 T5_server_error_500Each run writes a timestamped eval_<label>_<timestamp>.json in the repo root. The committed
llm_results_*.json files are frozen snapshots of the runs used for the submission; exact
bit-level reproduction requires the same provider endpoints and model versions available on
2026-04-19. The ablation JSON is fully reproducible from the commands above.
To regenerate benchmark_results.png after a new run:
python agent/plot_benchmark.pycd comtrade_env
docker build -t comtrade-env:latest -f server/Dockerfile .
docker run -p 8000:8000 comtrade-env:latest# Auto-uploads README, Dockerfile, server/, green/, blog, images, results JSONs.
# Uses `hf upload` so LFS is handled without a local git-lfs install.
bash deploy_hf.shOr, from scratch with the OpenEnv CLI:
openenv push --repo-id <your-hf-org>/comtrade-env- Same env code in training and eval. Rollouts use
InProcessEnvClient, eval uses the Docker Space. Both construct the identicalComtradeEnvironmentinstance, so training conditions and judged conditions never diverge. - Episode isolation across concurrent rollouts. The embedded mock service keys state by
(task_id, episode_id), so parallel GRPO workers never corrupt each other's data even though they share one service. - Procedural fixtures, not recorded data. All 10 tasks are generated from a seeded PRNG. No external API dependency, no fixture drift, full reproducibility from a task ID plus seed.
- Scoring aligned to training signal. The six-dimensional judge emits a scalar reward that matches the same breakdown used for eval, so GRPO optimises directly against the evaluation metric rather than a proxy.
| Task | Score | Reward |
|---|---|---|
| T1 Single page | 98.0 | 0.980 |
| T2 Multi-page | 98.0 | 0.980 |
| T3 Duplicates | 98.0 | 0.980 |
| T4 Rate limit | 95.0 | 0.950 |
| T5 Server error | 95.7 | 0.957 |
| T6 Page drift | 94.0 | 0.940 |
| T7 Totals trap | 98.0 | 0.980 |
| T8 Mixed faults | 96.4 | 0.964 |
| T9 Adaptive adversary | 96.9 | 0.969 |
| T10 Constrained budget | 98.0 | 0.980 |
| Average | 96.8 | 0.968 |
Rule-based baseline vs. Kimi LLM agent across the 10-task suite.
Three matplotlib panels for the Qwen2.5-3B + LoRA Lambda A100 run — the fine-grained per-iter view that complements the aggregated envelope figure above:
- Left — reward vs iteration: clear learning signal from iter 3 to iter 14 (mean reward oscillates 0.0 – 0.73 as different task subsets are sampled). Collapse at iter 15 is visible — the line drops to 0.0 at iter 12 (sampled an unlucky T9 + infra failure) and then collapses for real at iter 18 (mean 0.027, the LoRA adapter having drifted into a degenerate output region).
- Middle — per-task reward: shows which specific tasks contributed to each iter's mean. T7 (totals trap) and T8 (mixed faults) are frequently the hardest for 3B to solve; T1 and T3 reliably hit near-max rewards during the learning phase.
- Right — loss and KL divergence: KL grows monotonically (adapter drifts from base) up through iter 14, then keeps rising at iter 18 (1e-3) even after mean reward has collapsed — confirming the adapter kept moving through the collapse, into a worse region of policy space. Loss oscillates positive/negative as expected (GRPO loss sign tracks policy-improvement direction, not "goodness").
Raw per-iter JSON in grpo_gradient_training_3b.jsonl; interpretation + collapse diagnosis in grpo_3b_lora_collapse.json. Note iters 15-17 are missing from the JSON because they produced zero valid rollouts (no gradient step, nothing to record).
All 10 tasks run under the same moonshot-v1-128k variant, temperature=0.0, seed=42. See
llm_results_kimi.json for the full breakdown including per-dimension sub-scores.
| Task | Score | Reward | Delta vs baseline (pts) |
|---|---|---|---|
| T1 Single page | 98.7 | 0.987 | +0.7 |
| T2 Multi-page | 98.7 | 0.987 | +0.7 |
| T3 Duplicates | 98.7 | 0.987 | +0.7 |
| T4 Rate limit (429) | 95.7 | 0.957 | +0.7 |
| T5 Server error (500) | 96.3 | 0.963 | +0.6 |
| T6 Page drift | 94.7 | 0.947 | +0.7 |
| T7 Totals trap | 98.7 | 0.987 | +0.7 |
| T8 Mixed faults | 97.3 | 0.973 | +0.9 |
| T9 Adaptive adversary | 97.5 | 0.975 | +0.6 |
| T10 Constrained budget | 98.7 | 0.987 | +0.7 |
| Average (T1-T10) | 97.5 | 0.975 | +0.7 |
Kimi-128k matches or slightly exceeds the rule-based baseline on all 10 tasks. The remaining gap on T4/T5 Robustness (12/15, not 15/15) is a scoring sub-criterion explored in the ablation below, not a silent-retry failure.
| Model | Avg (T1-T10) | T1-T8 avg | T9 score | T10 score |
|---|---|---|---|---|
| Rule-based baseline | 96.8 | 96.5 | 96.9 | 98.0 |
| Kimi Moonshot V1-128k | 97.5 | 97.4 | 97.5 (std 0.0 across 5 seeds) | 98.7 |
| Claude Sonnet 4.6 | 97.5 | 97.4 | 97.5 | 98.7 |
| Qwen2.5-7B-Instruct (open, zero-shot) ⭐ | 97.2 | 97.2 | 97.5 | 98.7 |
| GPT-5 | 93.2 | 95.0 | 75.7 | 95.7 |
| Llama 3.3 70B (Groq) | 89.3 | 97.4 | 18.7 – 97.5 (bimodal†) | 95.7 |
† Llama T9 is bimodal across seeds: published seed-42 run hit 18.7, but multi-seed re-run produced {97.5, 94.5, 429, 429, 429} where the three 429s are Groq daily token-limit rate limits, not model failures. multiseed_llama_t9_summary.json.
Three independent discriminative signals:
-
T9 separates execution-oriented from reasoning-oriented frontier. Kimi and Claude execute T9 in ~8 s with 7 tool calls and score 97.5. GPT-5 "thinks" for ~223 s across only 2 tool calls and scores 75.7 — a 21.8-point gap between frontier models that a pass/fail benchmark would completely miss. GPT-5's Efficiency drops to 6/15 (uses almost the whole budget in reasoning-time) and Observability to ~4/10 (2 steps leave almost no audit trail). The benchmark measures execution behaviour under adversity, not raw reasoning capability — and the two diverge at the frontier.
-
Frontier saturates at the top. Kimi-128k and Claude Sonnet 4.6 produce numerically identical per-task scores across all 10 tasks (98.7 / 98.7 / 98.7 / 95.7 / 96.3 / 94.7 / 98.7 / 97.3 / 97.5 / 98.7). Not close — identical. The environment is seeded, the judge is deterministic, and both frontier models solve each task the same way → same score. The residual 2.5-pts-per-task gap below perfect is a rubric ceiling (Robustness 12/15 on T4/T5 is a keyword-match artifact, Observability ~8.67/10 by design), not a model capability gap. ComtradeBench today cannot fine-rank two execution-optimised frontier models.
-
Sub-frontier is high-variance, not uniformly weak. Multi-seed Kimi T9 = 97.5 with std 0.0 across 5 seeds. Multi-seed Llama T9 spans 18.7 – 97.5. The discriminative signal is reliability, not capability: Llama can sometimes match frontier, just not consistently. Production agent deployment needs the consistent half.
Full per-task breakdowns in llm_results_kimi.json, llm_results_claude.json, llm_results_gpt5.json, llm_results_llama.json, multiseed_kimi_t9_summary.json, multiseed_llama_t9_summary.json.
We originally claimed the T4/T5 (HTTP 429 / 500) Robustness gap could be closed with an EVENTS scratchpad prompt pattern. The data says otherwise. Three conditions on Kimi (same model family, same agent loop, same seed):
| Condition | Context | Prompt | T4 Robustness | T5 Robustness |
|---|---|---|---|---|
| A | 8k | default | 0 / 15 | 0 / 15 |
| B | 128k | default | 12 / 15 | 12 / 15 |
| C | 128k | EVENTS scratchpad (enhanced) | 12 / 15 | 12 / 15 |
A → B (context effect): +12 Robustness on both tasks just from enlarging the context window. B → C (prompt effect): zero additional gain from explicit EVENTS instructions.
The original T4/T5 = 0 Robustness result was not a narration failure — it was a context-truncation
failure. At 8k, the retry narration fell off the back of the buffer before it could land in
run_log. At 128k, the same prompt captures everything. Adding explicit EVENTS scaffolding on top
changes nothing, because the model already logs adequately when it has room to.
Takeaway for agent builders: on tool-use benchmarks with long trajectories, size the context
to the episode length before reaching for prompt engineering. A prompt cannot recover narration
that was never written because the buffer filled up. Full data in ablation_context_vs_prompt.json.
The six-dimensional rubric is weighted 30/15/15/15/15/10. The design principle is that correctness is necessary but not sufficient — so Correctness gets the largest single weight (30), but the combined weight of "execution quality under adversity" dimensions (Completeness + Robustness + Efficiency + Data Quality = 60) exceeds Correctness. This forces scoring to reward agents that do the job right, not just return something plausible. Observability at 10 is intentionally lower than the execution dimensions: it's an audit requirement rather than a core task, but it's not zero because an un-auditable pipeline is not a production-ready pipeline.
| Benchmark | Adversarial faults in env | Within-episode non-stationarity | Multi-dim execution scoring | Budget constraints |
|---|---|---|---|---|
| ToolBench (Qin et al., 2023) | — | — | — | — |
| τ-bench (Sierra / Anthropic) | partial (policy violations) | — | ✓ (pass@k on policies) | — |
| BFCL (Berkeley) | — | — | — | — |
| API-Bank | — | — | — | — |
| ComtradeBench | ✓ (429/500/drift/dupes/totals) | ✓ (T9) | ✓ (6 dimensions) | ✓ (T10) |
Closest relative is τ-bench — it also scores beyond "did the final answer match" and injects policy-level adversarial conditions. ComtradeBench's unique combination is environment-level fault injection + within-episode escalation (T9) + budget-aware rollouts (T10). The adversarial bits are not in the prompts or the labels — they are in the environment, so an agent cannot route around them by rephrasing.
These are the specific things this release does not yet do:
- Frontier saturation at the ceiling. Kimi-128k and Claude Sonnet 4.6 produce numerically identical per-task scores across all 10 tasks (97.5 avg each). ComtradeBench today measures execution reliability well but does not fine-rank two execution-optimised frontier models against each other. A harder T9 variant with steeper mid-episode escalation, plus additional tasks T11+ targeting frontier-model behaviours, would reopen cross-frontier discrimination.
- Sub-frontier reliability is noisy, not uniform. Llama 3.3 70B on T9 is bimodal: same seed
produced 18.7 on the original run and 97.5 on the multi-seed re-run, and three of five seeds
hit Groq daily token-limit 429s rather than model failures. The correct statement is Llama
is high-variance on T9, not Llama uniformly collapses. Multi-seed evidence in
multiseed_llama_t9_summary.json. - T4/T5 Robustness ceiling at 12/15 is a rubric string-matching artifact. Reading
server/judge.pyL293-336, the +3 bonus on rate-limit tasks requires the literal keyword"exponential"or"backoff"inrun.log; on server-error tasks it requires"max"or"limit". The retry logic itself is correct; the ceiling is a rubric artifact, not a model capability gap. Future work is to broaden the keyword set or move to a semantic check. - Five LLMs evaluated. Kimi Moonshot V1-128k, Claude Sonnet 4.6, GPT-5, Llama 3.3 70B, and Qwen2.5-7B-Instruct (open-source, zero-shot). Adding Gemini, Qwen2.5-72B, and DeepSeek would broaden the cross-model story further, though the current data already exposes four independent discriminative findings (execution-vs-reasoning at the frontier, saturation at the ceiling, reliability at the sub-frontier, open-source parity with closed frontier).
- Single-seed evaluation for most LLMs. Kimi and Llama have multi-seed data on T9
(
multiseed_*_summary.json). Claude, GPT-5, Qwen2.5-7B, and all other tasks use seed=42 only. Expanding multi-seed coverage is future work. - GRPO training stability engineering is future work. The 3B + LoRA collapse at iter 15 is a diagnosable instability: adaptive KL penalty, stricter trust-region clipping, or early-stop on reward-variance collapse would likely stabilise the learning window past iter 14. We did not perform this hyperparameter engineering in the submission window.
- Benchmark comparison is qualitative. The feature matrix vs τ-bench / BFCL / ToolBench is qualitative. We have not yet run the same Kimi agent across all four benchmarks to produce a quantitative cross-benchmark anchor.
Environment code follows the OpenEnv BSD-style license. Agent training code is provided as-is for the AgentBeats competition.


