Telemetry RCA Bench

View the dashboard on GitHub Pages: telemetry-rca-bench dashboard

This repo evaluates whether LLMs can do PostgreSQL RCA from captured telemetry rather than a hand-picked error snippet.

Each incident spins up Postgres, triggers a failure mode, captures telemetry, and scores model diagnoses against ground truth.

What's in the benchmark

12 incidents across three difficulty levels:

Single-cause (01–06): missing index, lock contention, deadlock, connection exhaustion, blocked autovacuum, work_mem spill
Compound (07–10): multiple interacting causes where the main failure is easy to rank incorrectly
Hard (11–12): root cause is only weakly visible in Postgres telemetry, or outside it

2 models: GPT-5.4 and Opus-4.6

2 telemetry tiers per model:

Tier	What the model sees
`partial`	`postgres.log` + `pg_stat_activity`
`full`	Everything in `partial` plus `pg_locks`, `pg_stat_statements`, `pg_stat_user_tables`, `metrics.csv`, and `EXPLAIN ANALYZE` plans

This lets us measure two things: how much can a model infer from logs and active-session snapshots alone, and how much does richer telemetry improve diagnosis.

Repository layout

incidents/: Incident definitions, setup SQL, workloads/faults, and ground truth.
scripts/:
- capture.py: Catalog polling, metrics, and EXPLAIN capture.
- diagnosis.py: Prompt assembly and LLM API orchestration.
- run_all.py: Batch runner across incidents, models, and tiers.
- precompute_ui.py: Exports result data for the dashboard.
results/: Archive of raw prompts and model responses.
dashboard/: React/Vite UI for comparative analysis.

How it works

Each incident follows the same pipeline:

Start Postgres in Docker with benchmark config
Run setup SQL (schema + seed data)
Start the workload and fault injector
Poll telemetry for a fixed capture window
Copy the Postgres log and capture EXPLAIN (ANALYZE, BUFFERS) for slow queries
Build diagnosis prompts from the captured data
Run each model on both partial and full telemetry
Score outputs against ground truth present in ground_truth.yaml.

Telemetry collected

Source	What it captures
`pg_stat_activity`	Backend state, waits, query text, transaction start times
`pg_locks`	Held/requested locks with backend state
`pg_stat_statements`	Top queries by execution time, calls, buffer and temp-block stats
`pg_stat_user_tables`	Scan counts, tuple churn, dead tuples, vacuum timestamps
`metrics.csv`	Connection/query activity timeline
`postgres.log`	Raw server log
`explain_plans/*.txt`	`EXPLAIN ANALYZE` for top slow queries

Setup

Prerequisites: Docker, Python 3.11+, Node.js 20+ (for dashboard)

# Python environment
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# API keys — create .env in the repo root (git-ignored)
echo "OPENAI_API_KEY=..." >> .env
echo "ANTHROPIC_API_KEY=..." >> .env

# Start Postgres
docker compose up -d

Postgres listens on localhost:5433 with database rcadb and user rcauser; see docker-compose.yml for the local benchmark config.

Running the benchmark

# Capture one incident
python incidents/01_missing_index/run.py

# Run one diagnosis
python scripts/diagnosis.py incidents/01_missing_index gpt-5.4 full
python scripts/diagnosis.py incidents/01_missing_index claude-opus partial

# Run the full matrix (all incidents x models x tiers)
python scripts/run_all.py

# Optional filters
python scripts/run_all.py --model gpt-5.4
python scripts/run_all.py --tier full
python scripts/run_all.py --incident 07_compound_query_perf

Outputs go to results/[incident]/[model]/[tier]/ (prompt.txt and response.json).

Dashboard

python scripts/precompute_ui.py   # refresh data.json + summary.json
cd dashboard && npm install && npm run dev

Static build: npm run build (configured for GitHub Pages at /telemetry-rca-bench). Live link: View the dashboard on GitHub Pages: telemetry-rca-bench dashboard

Incident corpus

Single-cause

#	Incident	Root Cause
01	Missing index	No index on filter column, full table scan on 1M rows
02	Lock contention	Long-running transaction holds row lock, blocks concurrent updates
03	Deadlock	Two transactions lock rows in opposite order
04	Connection exhaustion	Idle-in-transaction sessions consume all connection slots
05	Autovacuum blocked	Long-running transaction prevents dead tuple cleanup
06	Work_mem spill	work_mem too low, sorts and hashes spill to disk

Compound

#	Incident	Causal chain
07	Query perf compounding	Missing index + stale statistics + low work_mem — three issues visible in a single EXPLAIN
08	Vacuum death spiral	Long txn → vacuum blocked → dead tuples → bloat → stale stats → bad plans
09	Cascade pileup	Slow cascade → locks held → transactions pile up → connections exhaust
10	Connection leak spiral	Leaked connections hold snapshots → vacuum blocked → bloat → queries degrade

Hard

#	Incident	Root Cause	Why it's hard
11	Noisy neighbor	External CPU pressure from non-Postgres process	All internal database signals look normal, so the correct answer depends on noticing the lack of an internal cause.
12	Wrong planner cost	`random_page_cost` set to 100 forces seq scans	Indexes and stats look fine, but the planner still prefers seq scans because of a bad cost setting.

Results

Incident	GPT-5.4 Full	GPT-5.4 Partial	Opus-4.6 Full	Opus-4.6 Partial
01 missing_index	10	8	10	9
02 lock_contention	10	9	10	9
03 deadlock	10	9	10	10
04 connection_exhaustion	10	9	9	9
05 autovacuum_blocked	7	8	10	9
06 work_mem_spill	8	4	10	4
07 compound_query_perf	8	4	9	7
08 compound_vacuum_spiral	7	6	9	8
09 compound_cascade_pileup	9	6	9	9
10 compound_connection_leak	7	7	9	8
11 noisy_neighbor	2	2	3	2
12 wrong_planner_cost	8	4	10	4
Average	8.0	6.3	9.0	7.3

Result analysis

Where models succeed

Incidents 01 to 04 are mostly solved by all four configurations. Partial and full stay close because the key evidence is already visible in logs and pg_stat_activity.

Where models fail

11_noisy_neighbor is the hardest case. The correct answer is external CPU pressure, but every configuration invents an internal Postgres cause.
12_wrong_planner_cost splits sharply by tier because the SET random_page_cost = 100 smoking gun only appears in pg_stat_statements.
06_work_mem_spill also depends on full telemetry because the EXPLAIN plans show the disk spills directly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Telemetry RCA Bench

What's in the benchmark

Repository layout

How it works

Telemetry collected

Setup

Running the benchmark

Dashboard

Incident corpus

Single-cause

Compound

Hard

Results

Result analysis

Where models succeed

Where models fail

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
config		config
dashboard		dashboard
incidents		incidents
results		results
scripts		scripts
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Telemetry RCA Bench

What's in the benchmark

Repository layout

How it works

Telemetry collected

Setup

Running the benchmark

Dashboard

Incident corpus

Single-cause

Compound

Hard

Results

Result analysis

Where models succeed

Where models fail

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages