English | 简体中文
OpsPilot is an AI-powered production incident troubleshooting agent that uses the Model Context Protocol (MCP) for tool integration and context provision. It provides automated incident investigation with structured diagnostic reports and full traceability.
- MCP Architecture: All tool calls go through MCP, making the agent highly extensible
- FSM Workflow: Structured investigation flow: Intake → Clarify → Investigate → Hypothesis → Verify → Recommend → Report
- Schema Validation: All LLM outputs validated with Pydantic, with automatic retry and fallback
- Observability: Full trace of all LLM and MCP tool calls in JSONL format
- Dual LLM Support: OpenAI for production, DummyLLM for offline testing/demo
┌─────────────────────────────────────────────────────────────────┐
│ Agent Host (opspilot) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ FSM │ │ LLM │ │ Tracer │ │
│ │ Workflow │ │ Provider │ │ (JSONL) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ MCP Client │
└───────────────────────────┬─────────────────────────────────────┘
│ stdio
┌───────────────────────────┴─────────────────────────────────────┐
│ MCP Server (opspilot_mcp) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Tools │ │ Resources │ │ Storage │ │
│ │ log_search │ │ runbooks │ │ data/*. │ │
│ │ metric_query│ │ │ │ │ │
│ │ runbook_* │ │ │ │ │ │
│ │ change_* │ │ │ │ │ │
│ │ ticket_* │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
# Clone the repository
git clone https://github.com/diverpet/opspilot.git
cd opspilot
# Install dependencies
pip install -e .
# Or
pip install -r requirements.txtpython -m opspilot_mcp.server --stdio# Using DummyLLM (default, no API key needed)
python -m opspilot.main \
--service api-gateway \
--alert "High latency alert: P99 > 2000ms" \
--time-range 30m \
--mcp-stdio "python -m opspilot_mcp.server --stdio"
# Using OpenAI
export LLM_PROVIDER=openai
export OPENAI_API_KEY=your-api-key
export OPENAI_MODEL=gpt-4o-mini # optional
python -m opspilot.main \
--service api-gateway \
--alert "High latency alert: P99 > 2000ms" \
--time-range 30m \
--mcp-stdio "python -m opspilot_mcp.server --stdio"python -m tests.runThe MCP server provides the following tools:
| Tool | Description |
|---|---|
log_search |
Search logs for a service with query filtering |
metric_query |
Query time-series metrics (latency, error_rate, cpu, etc.) |
runbook_lookup |
Find relevant runbooks based on symptoms |
change_history |
Get recent deployment/change history |
ticket_create |
Create incident tickets |
| Resource | Description |
|---|---|
runbooks://index |
List all available runbooks |
runbooks://{name} |
Get content of a specific runbook |
To add a new tool, you only need to modify the MCP server:
- Create tool implementation in
src/opspilot_mcp/tools/:
# src/opspilot_mcp/tools/my_new_tool.py
from ..schemas import MyToolInput, MyToolOutput
def my_new_tool(param1: str, param2: int) -> dict:
# Validate input
input_data = MyToolInput(param1=param1, param2=param2)
# Execute logic
result = {...}
# Validate output
output = MyToolOutput(**result)
return output.model_dump()- Register the tool in
src/opspilot_mcp/server.py:
# Add to list_tools()
Tool(
name="my_new_tool",
description="Description of the tool",
inputSchema={...}
)
# Add to call_tool()
elif name == "my_new_tool":
result = my_new_tool(arguments["param1"], arguments["param2"])No changes needed in the agent code! The agent discovers tools via MCP.
# Incident Report: API Gateway High Latency
**Service:** api-gateway
**Generated:** 2024-01-15T10:35:00
**Confidence:** 85%
## Root Cause Analysis
Database connection pool exhaustion due to traffic surge post-deployment.
## Evidence Chain
1. **log_search**: Connection timeout errors detected starting 30 minutes ago
2. **metric_query**: Latency increased 5x, error rate elevated to 5%
3. **change_history**: Deployment with connection pool configuration change
## Immediate Actions
- [ ] Rollback to previous stable version
- [ ] Increase connection pool size
- [ ] Monitor error rates{"ts": "2024-01-15T10:30:00", "event_type": "state", "name": "intake -> investigate", "input": "...", "output_summary": "", "ok": true, "latency_ms": 0}
{"ts": "2024-01-15T10:30:01", "event_type": "mcp_tool", "name": "log_search", "input": "{\"service\": \"api-gateway\"...}", "output_summary": "Found 15 log entries...", "ok": true, "latency_ms": 45}
{"ts": "2024-01-15T10:30:02", "event_type": "mcp_tool", "name": "metric_query", "input": "{\"service\": \"api-gateway\"...}", "output_summary": "High error rate: 5%...", "ok": true, "latency_ms": 32}
{"ts": "2024-01-15T10:30:05", "event_type": "llm", "name": "investigate", "input": "Analyze the investigation...", "output_summary": "{\"log_findings\":...", "ok": true, "latency_ms": 1500}opspilot/
├── src/
│ ├── opspilot/ # Agent Host
│ │ ├── agent/
│ │ │ ├── fsm.py # State machine
│ │ │ ├── schemas.py # Pydantic schemas
│ │ │ ├── runtime.py # Agent runtime
│ │ │ └── prompts.py # LLM prompts
│ │ ├── mcp_client/
│ │ │ └── client.py # MCP client wrapper
│ │ ├── observability/
│ │ │ └── tracer.py # Trace recording
│ │ ├── llm/
│ │ │ ├── base.py # Base LLM interface
│ │ │ ├── openai_llm.py # OpenAI provider
│ │ │ └── dummy_llm.py # Offline provider
│ │ └── main.py # CLI entry point
│ │
│ └── opspilot_mcp/ # MCP Server
│ ├── server.py # Server entry point
│ ├── tools/ # Tool implementations
│ ├── resources/ # Resource handlers
│ ├── storage/ # Data access layer
│ └── schemas.py # Server schemas
│
├── data/ # Mock data
│ ├── logs/
│ ├── metrics/
│ └── changes/
├── runbooks/ # Runbook documents
├── artifacts/ # Generated reports/traces
└── tests/
├── cases.yaml # Test scenarios
└── run.py # Test runner
| Variable | Description | Default |
|---|---|---|
LLM_PROVIDER |
LLM provider (openai or dummy) |
dummy |
OPENAI_API_KEY |
OpenAI API key (required if using openai) | - |
OPENAI_MODEL |
OpenAI model name | gpt-4o-mini |
MIT