A fake-but-realistic Python SDK (acme-sdk-python) used as a test bed for an AI agent evaluation comparing two GitHub tool integration patterns:
- MCP arm: GitHub MCP server (
@modelcontextprotocol/server-github) - Skill arms:
ghCLI reference skill files (two variants: LobeHub and Vault)
For a talk: "MCP vs. Command Line: A Head-to-Head Evaluation of Agent Tool Integration Patterns" (AI Engineer Miami).
GitHub repo: seldo/acme-sdk-python
All deliverables from the original spec are complete. Two full --runs 1 passes (all 3 arms × 25 tasks) have been logged to Arize AX successfully. The reconciler cut full-eval wall time from ~15 hr to ~55 min — scaling to --runs 5 is now feasible (~4-5 hr).
build_repo.sh— initial scaffolding (not needed post-setup)setup_github.sh— cold-path full rebuild of GitHub metadataeval/repo_state.py+repo_state.json— fast incremental reconciler for warm resetseval/tasks.json— 25 tasks across 4 tiers (5/6/6/8)eval/run_eval.py,arms.py,evaluators.py,resolve_numbers.py,rate_limit.py— eval harnesseval/skills/— both skill files (LobeHub, Vault)
The spec's eval/reset_repo.sh was dropped as redundant with setup_github.sh.
Cold setup (first time, or if the repo is in an unknown state):
cd /Users/laurievoss/projects/arize/demos/acme-sdk-python
./setup_github.sh seldo/acme-sdk-python
python eval/repo_state.py snapshot seldo/acme-sdk-pythonTakes ~5 min. Creates 12 open issues, 3 open PRs, 3 milestones, and captures a snapshot for fast resets.
Between eval runs, the harness calls python eval/repo_state.py reconcile (not setup_github.sh) — this diffs live state against the snapshot and applies only the mutations needed to restore it (~5-10s instead of ~3 min). If structural drift can't be reconciled (deleted issue, merged PR, deleted branch), the harness automatically falls back to setup_github.sh and re-snapshots.
Issue/PR numbers are NOT baked into tasks.json — resolve_numbers.py substitutes them at runtime from {{PLACEHOLDER}} tokens.
cp eval/.env.example eval/.env
# Fill in: ARIZE_API_KEY, ARIZE_SPACE_ID, ANTHROPIC_API_KEY,
# GITHUB_PERSONAL_ACCESS_TOKEN, EVAL_REPO=seldo/acme-sdk-pythoncd eval
pip install -r requirements.txt# Tier 1 only, dry run (no Arize logging) — smoke test
python eval/run_eval.py --tier 1 --dry-run --repo seldo/acme-sdk-python
# Full 3-arm run, 5x per task
python eval/run_eval.py --repo seldo/acme-sdk-python
# Single arm, single task, 1 run
python eval/run_eval.py --arm mcp --task T01 --runs 1 --repo seldo/acme-sdk-pythonRunning from inside a Claude Code session: the harness spawns claude as a subprocess via claude-agent-sdk, which refuses to recurse. Strip the env vars first:
env -u CLAUDECODE -u CLAUDE_CODE_ENTRYPOINT python eval/run_eval.py ...acme-sdk-python/
├── build_repo.sh # Initial scaffolding script (not needed post-setup)
├── setup_github.sh # Cold-path: full rebuild of GitHub metadata (~3 min)
├── src/acme_sdk/ # Fake SDK source
├── tests/ # Test suite
├── docs/ # Documentation
├── examples/ # Usage examples
└── eval/
├── run_eval.py # Main eval harness
├── arms.py # 3-arm configurations (mcp, lobehub, vault)
├── evaluators.py # 5 scoring evaluators
├── rate_limit.py # Rate-limit checks + reset orchestration (reconcile → fallback)
├── repo_state.py # Fast incremental reconciler (snapshot / reconcile / diff)
├── repo_state.json # Expected repo state snapshot for reconciler
├── resolve_numbers.py # Dynamic issue/PR number resolution
├── tasks.json # 25 task definitions (uses {{PLACEHOLDER}} tokens)
├── README.md # Eval framework overview
├── .env.example # Template for required credentials
└── skills/
├── gh-cli-lobehub.md # LobeHub gh CLI skill
└── github-cli-vault.md # Vault gh CLI skill
- 3 arms: MCP (GitHub MCP server), Skill-LobeHub, Skill-Vault
- 25 tasks across 4 tiers (5 / 6 / 6 / 8)
- 5 runs per task per arm for variance measurement
- 5 evaluators: correctness, output_quality (LLM-as-judge, Tier 4 only), efficiency (tool calls vs. baseline), latency, tool_fidelity (did agent use the right integration pattern?)
- Results logged to Arize AX via their experiments API
- Write tasks reset repo state via
setup_github.shbetween runs (rate-limited byrate_limit.py)
Issue and PR numbers in tasks.json use {{PLACEHOLDER}} tokens (e.g., {{ISSUE_BATCH_SHUTDOWN}}, {{PR_CI_MATRIX}}). resolve_numbers.py queries GitHub at eval startup and substitutes real numbers into the raw JSON before parsing. All 16 placeholders (13 issues + 3 PRs) are defined in ISSUE_KEYS and PR_KEYS.
Never hand-edit numbers back into tasks.json, and do not re-add the jq substitution block to setup_github.sh — placeholder resolution makes both unnecessary and conflicting.
run_eval.py uses ArizeClient from the arize package:
client.datasets.create(space_id, name, examples=df)client.experiments.run(name, dataset_id, task, evaluators, ...)
Verify these match the current arize SDK before running — the SDK evolves.