Skip to content

abhishek593/telemetry-rca-bench

Repository files navigation

Telemetry RCA Bench

View the dashboard on GitHub Pages: telemetry-rca-bench dashboard

This repo evaluates whether LLMs can do PostgreSQL RCA from captured telemetry rather than a hand-picked error snippet.

Each incident spins up Postgres, triggers a failure mode, captures telemetry, and scores model diagnoses against ground truth.

What's in the benchmark

12 incidents across three difficulty levels:

  • Single-cause (01–06): missing index, lock contention, deadlock, connection exhaustion, blocked autovacuum, work_mem spill
  • Compound (07–10): multiple interacting causes where the main failure is easy to rank incorrectly
  • Hard (11–12): root cause is only weakly visible in Postgres telemetry, or outside it

2 models: GPT-5.4 and Opus-4.6

2 telemetry tiers per model:

Tier What the model sees
partial postgres.log + pg_stat_activity
full Everything in partial plus pg_locks, pg_stat_statements, pg_stat_user_tables, metrics.csv, and EXPLAIN ANALYZE plans

This lets us measure two things: how much can a model infer from logs and active-session snapshots alone, and how much does richer telemetry improve diagnosis.

Repository layout

  • incidents/: Incident definitions, setup SQL, workloads/faults, and ground truth.
  • scripts/:
  • results/: Archive of raw prompts and model responses.
  • dashboard/: React/Vite UI for comparative analysis.

How it works

Each incident follows the same pipeline:

  1. Start Postgres in Docker with benchmark config
  2. Run setup SQL (schema + seed data)
  3. Start the workload and fault injector
  4. Poll telemetry for a fixed capture window
  5. Copy the Postgres log and capture EXPLAIN (ANALYZE, BUFFERS) for slow queries
  6. Build diagnosis prompts from the captured data
  7. Run each model on both partial and full telemetry
  8. Score outputs against ground truth present in ground_truth.yaml.

Telemetry collected

Source What it captures
pg_stat_activity Backend state, waits, query text, transaction start times
pg_locks Held/requested locks with backend state
pg_stat_statements Top queries by execution time, calls, buffer and temp-block stats
pg_stat_user_tables Scan counts, tuple churn, dead tuples, vacuum timestamps
metrics.csv Connection/query activity timeline
postgres.log Raw server log
explain_plans/*.txt EXPLAIN ANALYZE for top slow queries

Setup

Prerequisites: Docker, Python 3.11+, Node.js 20+ (for dashboard)

# Python environment
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# API keys — create .env in the repo root (git-ignored)
echo "OPENAI_API_KEY=..." >> .env
echo "ANTHROPIC_API_KEY=..." >> .env

# Start Postgres
docker compose up -d

Postgres listens on localhost:5433 with database rcadb and user rcauser; see docker-compose.yml for the local benchmark config.

Running the benchmark

# Capture one incident
python incidents/01_missing_index/run.py

# Run one diagnosis
python scripts/diagnosis.py incidents/01_missing_index gpt-5.4 full
python scripts/diagnosis.py incidents/01_missing_index claude-opus partial

# Run the full matrix (all incidents x models x tiers)
python scripts/run_all.py

# Optional filters
python scripts/run_all.py --model gpt-5.4
python scripts/run_all.py --tier full
python scripts/run_all.py --incident 07_compound_query_perf

Outputs go to results/[incident]/[model]/[tier]/ (prompt.txt and response.json).

Dashboard

python scripts/precompute_ui.py   # refresh data.json + summary.json
cd dashboard && npm install && npm run dev

Static build: npm run build (configured for GitHub Pages at /telemetry-rca-bench). Live link: View the dashboard on GitHub Pages: telemetry-rca-bench dashboard

Incident corpus

Single-cause

# Incident Root Cause
01 Missing index No index on filter column, full table scan on 1M rows
02 Lock contention Long-running transaction holds row lock, blocks concurrent updates
03 Deadlock Two transactions lock rows in opposite order
04 Connection exhaustion Idle-in-transaction sessions consume all connection slots
05 Autovacuum blocked Long-running transaction prevents dead tuple cleanup
06 Work_mem spill work_mem too low, sorts and hashes spill to disk

Compound

# Incident Causal chain
07 Query perf compounding Missing index + stale statistics + low work_mem — three issues visible in a single EXPLAIN
08 Vacuum death spiral Long txn → vacuum blocked → dead tuples → bloat → stale stats → bad plans
09 Cascade pileup Slow cascade → locks held → transactions pile up → connections exhaust
10 Connection leak spiral Leaked connections hold snapshots → vacuum blocked → bloat → queries degrade

Hard

# Incident Root Cause Why it's hard
11 Noisy neighbor External CPU pressure from non-Postgres process All internal database signals look normal, so the correct answer depends on noticing the lack of an internal cause.
12 Wrong planner cost random_page_cost set to 100 forces seq scans Indexes and stats look fine, but the planner still prefers seq scans because of a bad cost setting.

Results

Incident GPT-5.4 Full GPT-5.4 Partial Opus-4.6 Full Opus-4.6 Partial
01 missing_index 10 8 10 9
02 lock_contention 10 9 10 9
03 deadlock 10 9 10 10
04 connection_exhaustion 10 9 9 9
05 autovacuum_blocked 7 8 10 9
06 work_mem_spill 8 4 10 4
07 compound_query_perf 8 4 9 7
08 compound_vacuum_spiral 7 6 9 8
09 compound_cascade_pileup 9 6 9 9
10 compound_connection_leak 7 7 9 8
11 noisy_neighbor 2 2 3 2
12 wrong_planner_cost 8 4 10 4
Average 8.0 6.3 9.0 7.3

Result analysis

Where models succeed

Incidents 01 to 04 are mostly solved by all four configurations. Partial and full stay close because the key evidence is already visible in logs and pg_stat_activity.

Where models fail

  • 11_noisy_neighbor is the hardest case. The correct answer is external CPU pressure, but every configuration invents an internal Postgres cause.
  • 12_wrong_planner_cost splits sharply by tier because the SET random_page_cost = 100 smoking gun only appears in pg_stat_statements.
  • 06_work_mem_spill also depends on full telemetry because the EXPLAIN plans show the disk spills directly.

About

Benchmark for LLM-based root cause analysis on PostgreSQL observability data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors