AgentProbe

Test how well AI agents interact with your CLI tools. AgentProbe runs Claude Code against any command-line tool and tells you where it struggles.

Quick Start

# No installation needed - run directly with uvx
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy

# Or install locally for development
uv sync
uv run agentprobe test vercel --scenario deploy

What It Does

AgentProbe launches Claude Code to test CLI tools and provides insights on:

Where agents get confused by your CLI
Which commands fail and why
How to improve your CLI's AI-friendliness

Community Benchmark

Help us build a comprehensive benchmark of CLI tools! The table below shows how well Claude Code handles various CLIs.

Tool	Scenarios	Passing	Failing	Success Rate	Last Updated
vercel	9	7	2	77.8%	2025-01-20
gh	1	1	0	100%	2025-01-20
docker	1	1	0	100%	2025-01-20

View detailed results →

Commands

Test Individual Scenarios

# Test a specific scenario (with uvx)
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr

# With custom working directory
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test docker --scenario run-nginx --work-dir /path/to/project

# Show detailed trace with message debugging
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verbose

Benchmark Tools

# Test all scenarios for one tool
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark vercel

# Test all available tools and scenarios
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark --all

Reports

# Generate reports (future feature)
uv run agentprobe report --format markdown --output results.md

Debugging and Verbose Output

The --verbose flag provides detailed insights into how Claude Code interacts with your CLI:

# Show full message trace with object types and attributes
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verbose

Verbose output includes:

Message object types (SystemMessage, AssistantMessage, UserMessage, ResultMessage)
Message content and tool usage
SDK object attributes and debugging information
Full conversation trace between Claude and your CLI

Example Output

╭─ AgentProbe Results ─────────────────────────────────────╮
│ Tool: vercel | Scenario: deploy                         │
│ Status: ✓ SUCCESS | Duration: 23.4s | Cost: $0.012     │
│                                                          │
│ Summary:                                                 │
│ • Task completed successfully                            │
│ • Required 3 turns to complete                          │
│                                                          │
│ Observations:                                            │
│ • Agent used help flag to understand the CLI            │
│                                                          │
│ Recommendations:                                         │
│ • Consider improving error messages to be more actionable│
╰──────────────────────────────────────────────────────────╯

Contributing Scenarios

We welcome scenario contributions! Help us test more CLI tools:

Fork this repository
Add your scenarios under scenarios/<tool-name>/
Run the tests and update the benchmark table
Submit a PR with your results

Scenario Format

Create simple text files with clear prompts:

# scenarios/stripe/create-customer.txt
Create a new Stripe customer with email test@example.com and
add a test credit card. Return the customer ID.

Running Benchmark Tests

# Test all scenarios for a tool
uv run agentprobe benchmark vercel

# Test all tools
uv run agentprobe benchmark --all

# Generate report (placeholder)
uv run agentprobe report --format markdown

Architecture

AgentProbe follows a simple 4-component architecture:

CLI Layer (cli.py) - Typer-based command interface
Runner (runner.py) - Executes scenarios via Claude Code SDK
Analyzer (analyzer.py) - Generic pattern analysis on execution traces
Reporter (reporter.py) - Rich terminal formatting for results

Requirements

Python 3.10+
uv package manager
Claude Code SDK (automatically installed)

Available Scenarios

Current test scenarios included:

GitHub CLI (gh/)
- create-pr.txt - Create pull requests
Vercel (vercel/)
- deploy.txt - Deploy applications to production
- preview-deploy.txt - Deploy to preview environment
- init-project.txt - Initialize new project with template
- env-setup.txt - Configure environment variables
- list-deployments.txt - List recent deployments
- domain-setup.txt - Add custom domain configuration
- rollback.txt - Rollback to previous deployment
- logs.txt - View deployment logs
- build-local.txt - Build project locally
Docker (docker/)
- run-nginx.txt - Run nginx containers

Browse all scenarios →

Development

# Install with dev dependencies
uv sync --extra dev

# Format code
uv run black src/

# Lint code
uv run ruff check src/

# Run tests (when implemented)
uv run pytest

See TASKS.md for the development roadmap and task tracking.

Programmatic Usage

import asyncio
from agentprobe import test_cli

async def main():
    result = await test_cli("gh", "create-pr")
    print(f"Success: {result['success']}")
    print(f"Duration: {result['duration_seconds']}s")
    print(f"Cost: ${result['cost_usd']:.3f}")

asyncio.run(main())

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AgentProbe

Quick Start

What It Does

Community Benchmark

Commands

Test Individual Scenarios

Benchmark Tools

Reports

Debugging and Verbose Output

Example Output

Contributing Scenarios

Scenario Format

Running Benchmark Tests

Architecture

Requirements

Available Scenarios

Development

Programmatic Usage

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

AgentProbe

Quick Start

What It Does

Community Benchmark

Commands

Test Individual Scenarios

Benchmark Tools

Reports

Debugging and Verbose Output

Example Output

Contributing Scenarios

Scenario Format

Running Benchmark Tests

Architecture

Requirements

Available Scenarios

Development

Programmatic Usage

License