AgentProbe is a Python-based CLI testing harness that launches Claude Code non-interactively to test how well AI agents interact with command-line tools. It records execution traces, analyzes patterns, and generates actionable recommendations for improving CLI usability. Distributed via Astral's uv/uvx for instant, install-free execution.
- Tool-Agnostic: Test any CLI tool without modifying AgentProbe's core
- Simple Scenarios: Plain text prompts, no complex configuration
- Reproducible Testing: Consistent test execution across environments
- Actionable Insights: Identify where agents struggle with your CLI
# Run directly with uvx
uvx agentprobe test vercel --scenario deploy
# Or install uv first if needed
curl -sSf https://astral.sh/uv/install | shpip install agentprobe- Python ≥ 3.10
- Claude Code CLI:
npm install -g @anthropic-ai/claude-code - Target CLI tool (e.g., vercel, gh, docker)
AgentProbe uses a simple command structure:
agentprobe test <tool> --scenario <name>Examples:
# Test Vercel deployment
agentprobe test vercel --scenario deploy
# Test GitHub CLI PR creation
agentprobe test gh --scenario create-pr
# Test Docker container management
agentprobe test docker --scenario run-nginxScenarios are plain text files containing prompts for Claude. No YAML, no configuration - just the task description.
Deploy this Next.js application to production using Vercel CLI.
Make sure the deployment is successful and return the deployment URL.
Create a pull request for the current branch with a descriptive title
and summary of the changes.
Create your own scenarios by adding text files:
mkdir -p scenarios/mycli
echo "Run mycli init and configure it for production use" > scenarios/mycli/setup.txt
agentprobe test mycli --scenario setupasync def run_test(tool: str, scenario_name: str):
# 1. Load scenario prompt
prompt = read_file(f"scenarios/{tool}/{scenario_name}.txt")
# 2. Execute with Claude Code SDK
trace = []
async for message in query(prompt=prompt, options=default_options):
trace.append(message)
# 3. Analyze and report
analysis = analyze_trace(trace)
print_report(analysis)- CLI (
cli.py) - Simple command-line interface using Typer - Runner (
runner.py) - Executes scenarios via Claude Code SDK - Analyzer (
analyzer.py) - Generic analysis of execution traces - Reporter (
reporter.py) - Formats results for terminal output
agentprobe/
├── __init__.py
├── cli.py # Command-line interface
├── runner.py # Claude Code SDK integration
├── analyzer.py # Trace analysis
├── reporter.py # Output formatting
└── scenarios/ # Example scenarios
├── vercel/
│ ├── deploy.txt
│ ├── dev.txt
│ └── rollback.txt
├── gh/
│ ├── create-pr.txt
│ └── clone.txt
└── docker/
├── build.txt
└── run.txt
AgentProbe performs generic analysis applicable to any CLI:
- Command Success/Failure - Did the CLI commands execute successfully?
- Error Recovery - How did the agent handle errors?
- Help Usage - Did the agent use
--helpwhen stuck? - Flag Discovery - Were the correct flags identified?
- Interactive Handling - How were prompts handled?
╭─ AgentProbe Results ─────────────────────────────────────╮
│ Tool: vercel | Scenario: deploy │
│ Status: ✓ SUCCESS | Duration: 23.4s | Cost: $0.012 │
├──────────────────────────────────────────────────────────┤
│ Summary: │
│ • Successfully deployed to https://app-xi.vercel.app │
│ • Required 5 turns to complete │
│ • No authentication errors encountered │
├──────────────────────────────────────────────────────────┤
│ Observations: │
│ • Agent needed 2 attempts to find correct deploy flag │
│ • Help command was used effectively │
│ • Output parsing was accurate │
╰──────────────────────────────────────────────────────────╯
| Option | Description | Default |
|---|---|---|
--scenario |
Scenario name | Required |
--work-dir |
Working directory | Current dir |
--max-turns |
Max agent interactions | 20 |
--verbose |
Show detailed trace | False |
--output |
Output format (text/json) | text |
import asyncio
from agentprobe import test_cli
async def main():
result = await test_cli(
tool="vercel",
scenario="deploy",
work_dir="/path/to/project"
)
print(f"Success: {result.success}")
print(f"Duration: {result.duration_seconds}s")
print(f"Cost: ${result.cost_usd}")
asyncio.run(main())- Create a directory under
scenarios/ - Add text files with prompts
- Run tests - no code changes needed
While AgentProbe's analysis is generic, you can extend it:
from agentprobe import run_test, analyze_trace
# Run test and get raw trace
trace = await run_test("mycli", "scenario")
# Custom analysis
my_analysis = my_custom_analyzer(trace)AgentProbe uses minimal configuration via environment variables:
# Claude Code settings
export ANTHROPIC_API_KEY=sk-...
export AGENTPROBE_MAX_TURNS=30
# Optional: Custom scenarios directory
export AGENTPROBE_SCENARIOS_DIR=/path/to/scenarios- Core execution engine
- Simple scenario format
- Basic analysis
- Package release
- Parallel test execution
- Comparison reports (multiple runs)
- Cost optimization features
- CI/CD integration examples
- Web dashboard for results
- Scenario sharing platform
- Multi-model support (GPT-4, local models)
- Simplicity First - Plain text scenarios, minimal configuration
- Tool Agnostic - No tool-specific code in core
- Extensible - Easy to add new CLIs without code changes
- Practical - Focus on real-world CLI testing needs
- Never include credentials in scenarios
- Run tests in isolated environments
- AgentProbe doesn't provide sandboxing
- Review Claude's actions with
--verboseflag
We welcome contributions:
- New example scenarios for popular CLIs
- Bug reports and feature requests
- Documentation improvements
MIT License - See LICENSE file for details.