Skip to content

Latest commit

 

History

History
204 lines (147 loc) · 6.05 KB

File metadata and controls

204 lines (147 loc) · 6.05 KB

AgentProbe

Test how well AI agents interact with your CLI tools. AgentProbe runs Claude Code against any command-line tool and tells you where it struggles.

Quick Start

# No installation needed - run directly with uvx
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test vercel --scenario deploy

# Or install locally for development
uv sync
uv run agentprobe test vercel --scenario deploy

What It Does

AgentProbe launches Claude Code to test CLI tools and provides insights on:

  • Where agents get confused by your CLI
  • Which commands fail and why
  • How to improve your CLI's AI-friendliness

Community Benchmark

Help us build a comprehensive benchmark of CLI tools! The table below shows how well Claude Code handles various CLIs.

Tool Scenarios Passing Failing Success Rate Last Updated
vercel 9 7 2 77.8% 2025-01-20
gh 1 1 0 100% 2025-01-20
docker 1 1 0 100% 2025-01-20

View detailed results →

Commands

Test Individual Scenarios

# Test a specific scenario (with uvx)
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr

# With custom working directory
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test docker --scenario run-nginx --work-dir /path/to/project

# Show detailed trace with message debugging
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verbose

Benchmark Tools

# Test all scenarios for one tool
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark vercel

# Test all available tools and scenarios
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe benchmark --all

Reports

# Generate reports (future feature)
uv run agentprobe report --format markdown --output results.md

Debugging and Verbose Output

The --verbose flag provides detailed insights into how Claude Code interacts with your CLI:

# Show full message trace with object types and attributes
uvx --from git+https://github.com/nibzard/agentprobe.git agentprobe test gh --scenario create-pr --verbose

Verbose output includes:

  • Message object types (SystemMessage, AssistantMessage, UserMessage, ResultMessage)
  • Message content and tool usage
  • SDK object attributes and debugging information
  • Full conversation trace between Claude and your CLI

Example Output

╭─ AgentProbe Results ─────────────────────────────────────╮
│ Tool: vercel | Scenario: deploy                         │
│ Status: ✓ SUCCESS | Duration: 23.4s | Cost: $0.012     │
│                                                          │
│ Summary:                                                 │
│ • Task completed successfully                            │
│ • Required 3 turns to complete                          │
│                                                          │
│ Observations:                                            │
│ • Agent used help flag to understand the CLI            │
│                                                          │
│ Recommendations:                                         │
│ • Consider improving error messages to be more actionable│
╰──────────────────────────────────────────────────────────╯

Contributing Scenarios

We welcome scenario contributions! Help us test more CLI tools:

  1. Fork this repository
  2. Add your scenarios under scenarios/<tool-name>/
  3. Run the tests and update the benchmark table
  4. Submit a PR with your results

Scenario Format

Create simple text files with clear prompts:

# scenarios/stripe/create-customer.txt
Create a new Stripe customer with email test@example.com and
add a test credit card. Return the customer ID.

Running Benchmark Tests

# Test all scenarios for a tool
uv run agentprobe benchmark vercel

# Test all tools
uv run agentprobe benchmark --all

# Generate report (placeholder)
uv run agentprobe report --format markdown

Architecture

AgentProbe follows a simple 4-component architecture:

  1. CLI Layer (cli.py) - Typer-based command interface
  2. Runner (runner.py) - Executes scenarios via Claude Code SDK
  3. Analyzer (analyzer.py) - Generic pattern analysis on execution traces
  4. Reporter (reporter.py) - Rich terminal formatting for results

Requirements

  • Python 3.10+
  • uv package manager
  • Claude Code SDK (automatically installed)

Available Scenarios

Current test scenarios included:

  • GitHub CLI (gh/)
    • create-pr.txt - Create pull requests
  • Vercel (vercel/)
    • deploy.txt - Deploy applications to production
    • preview-deploy.txt - Deploy to preview environment
    • init-project.txt - Initialize new project with template
    • env-setup.txt - Configure environment variables
    • list-deployments.txt - List recent deployments
    • domain-setup.txt - Add custom domain configuration
    • rollback.txt - Rollback to previous deployment
    • logs.txt - View deployment logs
    • build-local.txt - Build project locally
  • Docker (docker/)
    • run-nginx.txt - Run nginx containers

Browse all scenarios →

Development

# Install with dev dependencies
uv sync --extra dev

# Format code
uv run black src/

# Lint code
uv run ruff check src/

# Run tests (when implemented)
uv run pytest

See TASKS.md for the development roadmap and task tracking.

Programmatic Usage

import asyncio
from agentprobe import test_cli

async def main():
    result = await test_cli("gh", "create-pr")
    print(f"Success: {result['success']}")
    print(f"Duration: {result['duration_seconds']}s")
    print(f"Cost: ${result['cost_usd']:.3f}")

asyncio.run(main())

License

MIT