View the dashboard on GitHub Pages: telemetry-rca-bench dashboard
This repo evaluates whether LLMs can do PostgreSQL RCA from captured telemetry rather than a hand-picked error snippet.
Each incident spins up Postgres, triggers a failure mode, captures telemetry, and scores model diagnoses against ground truth.
12 incidents across three difficulty levels:
- Single-cause (01–06): missing index, lock contention, deadlock, connection exhaustion, blocked autovacuum, work_mem spill
- Compound (07–10): multiple interacting causes where the main failure is easy to rank incorrectly
- Hard (11–12): root cause is only weakly visible in Postgres telemetry, or outside it
2 models: GPT-5.4 and Opus-4.6
2 telemetry tiers per model:
| Tier | What the model sees |
|---|---|
partial |
postgres.log + pg_stat_activity |
full |
Everything in partial plus pg_locks, pg_stat_statements, pg_stat_user_tables, metrics.csv, and EXPLAIN ANALYZE plans |
This lets us measure two things: how much can a model infer from logs and active-session snapshots alone, and how much does richer telemetry improve diagnosis.
incidents/: Incident definitions, setup SQL, workloads/faults, and ground truth.scripts/:capture.py: Catalog polling, metrics, andEXPLAINcapture.diagnosis.py: Prompt assembly and LLM API orchestration.run_all.py: Batch runner across incidents, models, and tiers.precompute_ui.py: Exports result data for the dashboard.
results/: Archive of raw prompts and model responses.dashboard/: React/Vite UI for comparative analysis.
Each incident follows the same pipeline:
- Start Postgres in Docker with benchmark config
- Run setup SQL (schema + seed data)
- Start the workload and fault injector
- Poll telemetry for a fixed capture window
- Copy the Postgres log and capture
EXPLAIN (ANALYZE, BUFFERS)for slow queries - Build diagnosis prompts from the captured data
- Run each model on both
partialandfulltelemetry - Score outputs against ground truth present in
ground_truth.yaml.
| Source | What it captures |
|---|---|
pg_stat_activity |
Backend state, waits, query text, transaction start times |
pg_locks |
Held/requested locks with backend state |
pg_stat_statements |
Top queries by execution time, calls, buffer and temp-block stats |
pg_stat_user_tables |
Scan counts, tuple churn, dead tuples, vacuum timestamps |
metrics.csv |
Connection/query activity timeline |
postgres.log |
Raw server log |
explain_plans/*.txt |
EXPLAIN ANALYZE for top slow queries |
Prerequisites: Docker, Python 3.11+, Node.js 20+ (for dashboard)
# Python environment
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# API keys — create .env in the repo root (git-ignored)
echo "OPENAI_API_KEY=..." >> .env
echo "ANTHROPIC_API_KEY=..." >> .env
# Start Postgres
docker compose up -dPostgres listens on localhost:5433 with database rcadb and user rcauser; see docker-compose.yml for the local benchmark config.
# Capture one incident
python incidents/01_missing_index/run.py
# Run one diagnosis
python scripts/diagnosis.py incidents/01_missing_index gpt-5.4 full
python scripts/diagnosis.py incidents/01_missing_index claude-opus partial
# Run the full matrix (all incidents x models x tiers)
python scripts/run_all.py
# Optional filters
python scripts/run_all.py --model gpt-5.4
python scripts/run_all.py --tier full
python scripts/run_all.py --incident 07_compound_query_perfOutputs go to results/[incident]/[model]/[tier]/ (prompt.txt and response.json).
python scripts/precompute_ui.py # refresh data.json + summary.json
cd dashboard && npm install && npm run devStatic build: npm run build (configured for GitHub Pages at /telemetry-rca-bench).
Live link: View the dashboard on GitHub Pages: telemetry-rca-bench dashboard
| # | Incident | Root Cause |
|---|---|---|
| 01 | Missing index | No index on filter column, full table scan on 1M rows |
| 02 | Lock contention | Long-running transaction holds row lock, blocks concurrent updates |
| 03 | Deadlock | Two transactions lock rows in opposite order |
| 04 | Connection exhaustion | Idle-in-transaction sessions consume all connection slots |
| 05 | Autovacuum blocked | Long-running transaction prevents dead tuple cleanup |
| 06 | Work_mem spill | work_mem too low, sorts and hashes spill to disk |
| # | Incident | Causal chain |
|---|---|---|
| 07 | Query perf compounding | Missing index + stale statistics + low work_mem — three issues visible in a single EXPLAIN |
| 08 | Vacuum death spiral | Long txn → vacuum blocked → dead tuples → bloat → stale stats → bad plans |
| 09 | Cascade pileup | Slow cascade → locks held → transactions pile up → connections exhaust |
| 10 | Connection leak spiral | Leaked connections hold snapshots → vacuum blocked → bloat → queries degrade |
| # | Incident | Root Cause | Why it's hard |
|---|---|---|---|
| 11 | Noisy neighbor | External CPU pressure from non-Postgres process | All internal database signals look normal, so the correct answer depends on noticing the lack of an internal cause. |
| 12 | Wrong planner cost | random_page_cost set to 100 forces seq scans |
Indexes and stats look fine, but the planner still prefers seq scans because of a bad cost setting. |
| Incident | GPT-5.4 Full | GPT-5.4 Partial | Opus-4.6 Full | Opus-4.6 Partial |
|---|---|---|---|---|
| 01 missing_index | 10 | 8 | 10 | 9 |
| 02 lock_contention | 10 | 9 | 10 | 9 |
| 03 deadlock | 10 | 9 | 10 | 10 |
| 04 connection_exhaustion | 10 | 9 | 9 | 9 |
| 05 autovacuum_blocked | 7 | 8 | 10 | 9 |
| 06 work_mem_spill | 8 | 4 | 10 | 4 |
| 07 compound_query_perf | 8 | 4 | 9 | 7 |
| 08 compound_vacuum_spiral | 7 | 6 | 9 | 8 |
| 09 compound_cascade_pileup | 9 | 6 | 9 | 9 |
| 10 compound_connection_leak | 7 | 7 | 9 | 8 |
| 11 noisy_neighbor | 2 | 2 | 3 | 2 |
| 12 wrong_planner_cost | 8 | 4 | 10 | 4 |
| Average | 8.0 | 6.3 | 9.0 | 7.3 |
Incidents 01 to 04 are mostly solved by all four configurations. Partial and full stay close because the key evidence is already visible in logs and pg_stat_activity.
11_noisy_neighboris the hardest case. The correct answer is external CPU pressure, but every configuration invents an internal Postgres cause.12_wrong_planner_costsplits sharply by tier because theSET random_page_cost = 100smoking gun only appears inpg_stat_statements.06_work_mem_spillalso depends on full telemetry because the EXPLAIN plans show the disk spills directly.