SLO Agent - Service Level Objective Recommendations

The SLO Agent analyzes code and recommends appropriate Service Level Objective (SLO) targets based on service type, code patterns, and deployment context.

Overview

SLOs are critical for setting clear expectations about reliability, latency, and error budgets. The SLO Agent helps you determine appropriate targets by analyzing:

Code patterns: Database calls, HTTP requests, caching, retry logic, error handling, monitoring
Repository structure: Tests, Docker, Kubernetes configuration, monitoring setup
Service type: Explicit classification or auto-detected from code patterns
Context: User count, criticality, business impact

Service Types

SaaS (User-Facing Services)

User-visible availability and latency matter directly
Typical targets: 99.9% availability, p99 latency < 200ms, < 0.1% error rate
Example: REST API, web dashboard, mobile backend

Batch Jobs

Time-flexible, correctness-sensitive, chained dependencies common
Typical targets: 95% availability, < 1 hour duration, < 5% error rate
Example: Data pipeline, nightly ETL, report generation

Critical Infrastructure

Gateway, auth, platform services that other services depend on
Typical targets: 99.95% availability, p99 latency < 50ms, < 0.01% error rate
Example: Auth service, API gateway, database proxy

Background Jobs

Async workers, eventually-consistent, invisible to direct users
Typical targets: 98% availability, task latency < 5 min, < 1% error rate
Example: Email worker, notification queue, async task processor

CLI Usage

Basic SLO Check

# Analyze code from stdin
git diff main | greybeard slo-check

# Analyze a file
greybeard slo-check --file service.py

# Explicit service type
cat api.py | greybeard slo-check --context "service-type:saas"

With Context Flags

# Specify service type
greybeard slo-check --context "service-type:critical-infra"

# Multiple context flags
greybeard slo-check \
  --context "service-type:saas" \
  --context "service-name:user-api" \
  --context "users:50000"

# With repository context
greybeard slo-check \
  --repo /path/to/api \
  --context "service-type:saas"

Output Formats

# Table (default)
greybeard slo-check --output table

# JSON (for parsing/integration)
greybeard slo-check --output json

# Markdown (for documentation)
greybeard slo-check --output markdown > slo-targets.md

Code Pattern Detection

The agent analyzes code for reliability-relevant patterns:

Pattern	Detected By	Impact
Database calls	SELECT, INSERT, query(), execute()	Latency, availability
HTTP calls	requests, urllib, httpx, fetch	Timeouts, retries needed
Caching	redis, memcached, @cache, lru_cache	Latency optimization
Retry logic	retry, backoff, exponential	Failure recovery
Error handling	try, except, error handlers	Reliability
Async/await	async, await, asyncio	Concurrency model
Monitoring	logging, prometheus, datadog, tracing	Observability
Timeouts	timeout, deadline, ttl	Failure isolation

SLO Targets by Type

SaaS

Metric	Target	Range	Rationale
Availability	99.9%	99.5% - 99.95%	High user visibility; ~43 min/month acceptable downtime
Latency (p99)	< 200ms	< 100ms - < 500ms	Interactive; users notice slow responses
Error Rate	< 0.1%	< 0.05% - < 0.5%	Errors are visible; user-facing impact

Batch

Metric	Target	Range	Rationale
Availability	95%	90% - 99%	Time-flexible; retryable; ~1.5 days/month downtime acceptable
Job Duration (p95)	< 1 hour	< 30min - < 4 hours	Batch windows are flexible but finite
Error Rate	< 5%	< 1% - < 10%	Transient failures acceptable if idempotent

Critical Infrastructure

Metric	Target	Range	Rationale
Availability	99.95%	99.9% - 99.99%	Impacts all downstream services; ~21 min/month
Latency (p99)	< 50ms	< 10ms - < 100ms	Critical path multiplier; tight budget
Error Rate	< 0.01%	< 0.001% - < 0.1%	Cascading failures; very low tolerance

Background Jobs

Metric	Target	Range	Rationale
Availability	98%	95% - 99.5%	Users don't wait synchronously; ~7.2 hours/month
Task Latency (p95)	< 5 min	< 1min - < 30min	Eventually-consistent; async context
Error Rate	< 1%	< 0.1% - < 5%	Async can retry; reasonable error budget

Output Example

{
  "service_type": "saas",
  "service_name": "user-api",
  "targets": [
    {
      "metric": "availability",
      "target": "99.9%",
      "range": ["99.5%", "99.95%"],
      "rationale": "User-facing service; high availability expected. 99.9% = ~43 min/month downtime."
    },
    {
      "metric": "latency (p99)",
      "target": "< 200ms",
      "range": ["< 100ms", "< 500ms"],
      "rationale": "Interactive service; users expect responsive performance."
    },
    {
      "metric": "error_rate",
      "target": "< 0.1%",
      "range": ["< 0.05%", "< 0.5%"],
      "rationale": "User-facing errors are visible; low error tolerance."
    }
  ],
  "context_signals": {
    "code_indicators": {
      "has_db_calls": true,
      "has_http_calls": true,
      "has_caching": true,
      "has_retry_logic": true,
      "has_error_handling": true,
      "has_async": false,
      "has_monitoring": true,
      "has_timeout": true
    }
  },
  "confidence": 0.75,
  "notes": "Database calls have caching. HTTP calls have retries and timeouts. Good observability."
}

Recommendations & Notes

The agent generates context-specific recommendations:

⚠️ Error Handling: No try/except detected → add proper error handling
📊 Monitoring: No logging detected → instrument with metrics and logs
⏱️ Timeouts: No timeouts on external calls → prevent cascading failures
💾 Caching: DB calls without cache → consider Redis/memcached
🔄 Retries: HTTP calls without retry logic → add exponential backoff
🔁 Dead Letter Queue: Background jobs without retries → implement DLQ

Integration with greybeard

The SLO Agent integrates with greybeard's content pack system. Use SLO-specific packs for domain-specific guidance:

# Load SLO pack for SaaS perspective
greybeard analyze --pack slo-saas < code.py

# Load SLO pack for batch jobs
greybeard analyze --pack slo-batch < pipeline.py

# Load SLO pack for critical infra
greybeard analyze --pack slo-critical-infra < auth_service.py

Available SLO Packs

slo-saas - User-facing SaaS perspectives and heuristics
slo-batch - Batch job and scheduled task guidance
slo-critical-infra - Platform and gateway SLO thinking
slo-background-jobs - Async worker and queue guidance

Python API

from greybeard.agents import SLOAgent

agent = SLOAgent()

# Analyze code with context
recommendation = agent.analyze(
    code_snippet="""
    @app.get("/api/users")
    def get_users():
        return db.query(User).all()
    """,
    service_type="saas",
    context={
        "service-name": "user-api",
        "users": "50000",
    }
)

# Get targets
for target in recommendation.targets:
    print(f"{target.metric}: {target.target}")
    print(f"  Rationale: {target.rationale}")

# Get JSON for integration
import json
data = recommendation.to_dict()
print(json.dumps(data, indent=2))

Testing

The SLO Agent has comprehensive test coverage:

# Run all SLO Agent tests
pytest tests/test_slo_agent.py -v

# Run with coverage
pytest tests/test_slo_agent.py --cov=greybeard.agents.slo_agent

# Run specific test class
pytest tests/test_slo_agent.py::TestServiceTypeDetection -v

Test coverage: 93% of agent code

Test categories:

Basic functionality (5 tests) - Initialization, analysis, serialization
Service type detection (5 tests) - Auto-detect SaaS, batch, critical-infra, background
SLO target generation (4 tests) - Proper targets for each type
Code analysis (8 tests) - Pattern detection (DB, HTTP, caching, retries, etc.)
Repository analysis (4 tests) - Docker, K8s, tests, file counting
Context integration (3 tests) - Service name, explicit type, multiple flags
Recommendations (4 tests) - Note generation for missing patterns
Confidence scoring (2 tests) - Confidence ranges and type-specific scoring
CLI integration (1 test) - Command registration
Serialization (1 test) - JSON round-trip

Architecture

greybeard/
├── agents/
│   ├── __init__.py           # Exports SLOAgent, SLORecommendation, etc.
│   └── slo_agent.py          # SLOAgent implementation
├── cli_slo.py                # slo-check CLI command
├── cli.py                    # CLI integration (imports slo_check)
└── packs/
    └── slo-patterns/
        ├── saas.yaml         # SaaS SLO pack
        ├── batch.yaml        # Batch job SLO pack
        ├── critical-infra.yaml # Critical infrastructure SLO pack
        └── background-jobs.yaml # Background job SLO pack

tests/
└── test_slo_agent.py         # 37 comprehensive tests

Future Enhancements

Machine learning: Train on production SLOs to improve recommendations
Cost analysis: Estimate infrastructure cost for different SLO levels
Trend detection: Analyze historical performance vs. SLO targets
Multi-service orchestration: Recommend SLOs for service meshes
Budget calculations: Error budget tracking and alerts
LLM-powered guidance: Use LLM to provide detailed SLO recommendations based on code context

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLO Agent - Service Level Objective Recommendations

Overview

Service Types

SaaS (User-Facing Services)

Batch Jobs

Critical Infrastructure

Background Jobs

CLI Usage

Basic SLO Check

With Context Flags

Output Formats

Code Pattern Detection

SLO Targets by Type

SaaS

Batch

Critical Infrastructure

Background Jobs

Output Example

Recommendations & Notes

Integration with greybeard

Available SLO Packs

Python API

Testing

Architecture

Future Enhancements

References

FilesExpand file tree

SLO_AGENT.md

Latest commit

History

SLO_AGENT.md

File metadata and controls

SLO Agent - Service Level Objective Recommendations

Overview

Service Types

SaaS (User-Facing Services)

Batch Jobs

Critical Infrastructure

Background Jobs

CLI Usage

Basic SLO Check

With Context Flags

Output Formats

Code Pattern Detection

SLO Targets by Type

SaaS

Batch

Critical Infrastructure

Background Jobs

Output Example

Recommendations & Notes

Integration with greybeard

Available SLO Packs

Python API

Testing

Architecture

Future Enhancements

References