Autonomous AI agent that monitors a multi-layer Databricks lakehouse, reasons about anomalies using Claude API, and files structured incident reports — without hardcoded rules or expensive observability tooling.
Bad data costs organizations an estimated $12.9M annually on average (Gartner). This agent catches anomalies before they propagate from Bronze to Gold — protecting downstream business decisions from corrupted inputs.
Data pipelines break silently. A streaming table stops refreshing. A Silver enrichment job quietly fails. Downstream Gold tables go stale. Nobody knows until a business user asks why their dashboard hasn't updated — and that conversation happens at 9 AM in a Slack channel, in front of everyone.
Traditional monitoring tools solve this with rules: "alert if row count drops below X." That works — until your data changes shape, your pipeline evolves, or the threshold you set last quarter is no longer meaningful.
This agent takes a different approach.
A traditional monitoring script is like a smoke detector — it trips when a hardcoded threshold is crossed. This agent is like a fire investigator — it looks at the same signals, decides if they're worth investigating, traces the root cause, and writes a report with prioritized recommendations.
Same data. Completely different intelligence.
The key distinction: no hardcoded thresholds anywhere in this codebase. The reasoning about what looks wrong, what to investigate deeper, and what to recommend lives entirely in the LLM — not in if/else logic.
┌─────────────────────────────────────────────────────────┐
│ run_agent.py │
│ (single entrypoint) │
└──────────┬──────────────────┬──────────────────┬────────┘
│ │ │
▼ ▼ ▼
collect_metrics.py agent.py report_writer.py
[Eyes] [Brain] [Hands]
"What does the "What does it "File the incident"
lakehouse look mean? Is anything
like right now?" wrong?"
| Module | Role | What it does |
|---|---|---|
collect_metrics.py |
👁️ Eyes | Queries all 14 tables via Databricks REST API — row counts, null rates, freshness timestamps |
agent.py |
🧠 Brain | 3-turn Claude API reasoning loop — observe, investigate, report |
report_writer.py |
🤝 Hands | Persists structured incident report to local JSON file + Databricks Delta table |
run_agent.py |
🎯 Entrypoint | Single CLI command that orchestrates the full pipeline |
This is not a script. It is an agent. The difference matters.
A script executes a fixed sequence of steps regardless of what it finds. An agent observes, makes decisions, and adapts its next action based on what it learned.
Here is how the 3-turn reasoning loop works:
Turn 1 — Observe
Feed all 14 table metrics to Claude
Claude identifies suspicious tables and explains WHY
No thresholds. Pure LLM judgment.
↓
Turn 2 — Investigate
Agent fires targeted SQL queries ONLY on flagged tables
Feeds deeper data back to Claude
Claude refines its diagnosis — confirms or adjusts severity
↓
Turn 3 — Report
Claude produces structured incident JSON
Severity, root cause, recommended actions, priority order
Agent enriches with run metadata and persists
The decision of which tables to investigate, how to interpret the deeper data, and what to recommend is made by Claude — not by the engineer who wrote the code.
14 tables across Bronze / Silver / Gold layers of a Medallion Architecture lakehouse:
| Layer | Tables |
|---|---|
| 🥉 Bronze | bronze_customers, bronze_orders, bronze_products, bronze_order_items, bronze_orders_stream |
| 🥈 Silver | silver_customers_enriched, silver_order_items |
| 🥇 Gold | gold_customer_segments, gold_monthly_order_trends, gold_return_analysis, gold_revenue_by_category, gold_stream_anomalies, gold_top_customers |
| 🔧 Pipeline | pipeline_runs (meta-monitoring — is the pipeline itself healthy?) |
For each table the agent collects:
- Row count — volume anomalies, unexpected drops
- Null rates — data quality degradation on key columns
- Freshness — hours since last Delta table modification
============================================================
🔴 OVERALL SEVERITY: HIGH
📊 Tables monitored : 14
🚨 Incidents found : 4
🕐 Run ID : incident-20260422-022803
============================================================
Core transactional pipelines healthy. Streaming and customer
enrichment workflows broken for ~8 days.
INCIDENTS:
1. 🔴 [BRONZE] bronze_orders_stream
Observation : Stale 7.7 days — 402 completed, 48 returned orders
Likely cause: Streaming ingestion failure — connector stopped processing
Action : Check streaming service, verify source connectivity, restart job
2. 🟡 [SILVER] silver_customers_enriched
Observation : Stale 7.9 days — 10 records missing vs bronze layer
Likely cause: ETL job failure or external enrichment API timeout
Action : Investigate 10 missing records, check API status, trigger manually
3. 🟡 [GOLD] gold_customer_segments
Likely cause: Downstream dependency on stale silver_customers_enriched
Action : Fix Silver first — Gold will auto-resolve
4. 🔴 [GOLD] gold_stream_anomalies
Likely cause: Depends on bronze_orders_stream — cannot process new events
Action : Fix Bronze first, then backfill 7.7-day anomaly gap
IMMEDIATE ACTIONS:
1. Restart streaming ingestion pipeline for bronze_orders_stream
2. Check enrichment API and restart silver_customers_enriched ETL
3. Backfill anomaly detection for 8-day gap once streaming restored
Notice what the agent did here — it identified two root causes in a four-incident fire and prioritized them correctly. Fix Bronze streaming → Gold anomalies auto-resolves. Fix Silver enrichment → Gold segments auto-resolves. A rules-based monitor would have filed 4 equal alerts and left the on-call engineer to figure out the dependency chain at 2 AM.
The most dangerous pipeline failures are not the ones that throw errors. They are the ones that succeed silently — the job completes, no alerts fire, but the data is wrong.
To validate this agent against a real-world scenario, a silent corruption was introduced by deleting all rows from gold_revenue_by_category except one category (Electronics). The pipeline reported no errors. No threshold was breached. The table existed and was queryable.
The agent caught it immediately:
5. 🔴 [GOLD] gold_revenue_by_category
Observation : Only shows Electronics category with $421K revenue,
missing other product categories entirely
Likely cause: Category dimension table is incomplete or join logic
is filtering out non-Electronics products
Action : Verify product category mapping in bronze_products
and check aggregation logic for missing categories
More importantly — notice what the agent did not say. It did not blame an upstream failure. It correctly reasoned that Bronze and Silver layers were healthy, which meant the problem was isolated to the Gold transformation itself — a filter bug or join issue, not a pipeline connectivity problem.
That cross-layer reasoning — "upstream is fine, therefore the fault is in the transformation" — is what separates this agent from a rules-based monitor. A threshold alert would have fired on row count. Only a reasoning agent understands why the row count is wrong and where in the dependency chain to look.
Databricks Mosaic AI Agent Framework is excellent at executing predefined tool sequences — it can run SQL, check results, and take actions. But it requires you to encode the diagnostic logic explicitly.
This agent does something different. In Turn 1, Claude looks at all 14 tables holistically and forms a hypothesis about where in the dependency chain a failure originated — without being told what to look for. The targeted investigation queries in Turn 2 are fired based on that hypothesis, not a hardcoded checklist.
That is the difference between a tool-calling agent and a reasoning agent. The intelligence is not in the tools — it is in the judgment about which tools to use, in what order, and what the results mean.
Every agent run produces two outputs:
1. Local JSON file (reports/incident-<run-id>.json)
Zero-friction, always readable, great for CLI workflows and debugging.
2. Databricks Delta table (workspace.ecommerce.incident_reports)
Because incident reports are data — and data engineers store things as data.
Over time this table becomes a queryable history:
-- How often does bronze_orders_stream go stale?
SELECT generated_at, severity, incident_count
FROM workspace.ecommerce.incident_reports
ORDER BY generated_at DESCThat is observability you can chart, trend, and build alerts on. Not just a log file.
- Python 3.8+
- Databricks workspace with Unity Catalog
- Anthropic API key (console.anthropic.com)
git clone https://github.com/YOUR_USERNAME/data-incident-agent.git
cd data-incident-agent
python3 -m venv venv
source venv/bin/activate
pip install databricks-sdk requests anthropic python-dotenvCreate a .env file in the project root:
DATABRICKS_HOST=https://your-workspace.azuredatabricks.net
DATABRICKS_TOKEN=your-personal-access-token
DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/your-warehouse-id
ANTHROPIC_API_KEY=your-anthropic-api-key
⚠️ .envis in.gitignore— your secrets will never be committed.
# Full run — collect metrics, reason, write report
python run_agent.py
# Dry run — Eyes + Brain only, skip writing report
python run_agent.py --dry-run
# Output raw JSON (useful for piping to other tools)
python run_agent.py --json| Code | Meaning |
|---|---|
0 |
CLEAR / LOW / MEDIUM severity |
2 |
HIGH severity — use this to trigger Databricks Job failure alerts |
Databricks natively handles:
- DLT expectations (
@dlt.expect) for declarative quality rules - Lakehouse Monitoring for drift detection
- Workflow retry policies for self-healing jobs
What Databricks does not do:
- Reason about whether a metric deviation actually matters in context
- Understand dependency chains — which upstream failure caused which downstream staleness
- Prioritize remediation steps based on blast radius
- Explain root cause in plain language a non-engineer can act on
That reasoning layer is what this agent adds — and it is the gap that enterprise observability platforms charge $50K–$200K/year to fill.
This project is part of a larger Databricks data engineering portfolio:
| Repo | What it covers |
|---|---|
ecommerce-pipeline |
Medallion Architecture, PySpark, DLT, streaming, NL data assistant |
data-incident-agent |
Agentic AI, LLM reasoning loops, autonomous observability |
The ecommerce-pipeline lakehouse is the data source this agent monitors — making these two repos a natural pair that together demonstrate the full spectrum from pipeline engineering to AI-powered operations.
- Databricks — lakehouse platform, Delta tables, SQL warehouses, REST API
- Claude API (Anthropic) — LLM reasoning engine for the agentic loop
- Databricks SDK — auto-credential detection, warehouse discovery
- Python —
requests,python-dotenv,databricks-sdk
- v2 — DLT enrichment: Correlate incidents with DLT event log for deeper root cause (pipeline expectation failures, flow-level diagnostics)
- v3 — Slack notifications: Push HIGH severity incidents to a Slack webhook
- v4 — Scheduled Job: Run as a Databricks Job on a schedule with email alerting on exit code 2
Built by Venkat Chittoor in collaboration with Claude (Anthropic) — a demonstration that the engineers who thrive in the AI era are not those who know everything, but those who adapt fast, leverage the best tools available, and ship things that matter. The future belongs to those who adapt and adopt.
This project was independently developed by Venkat Chittoor on personal time, using personal resources, and is not affiliated with or owned by any employer or client organization.
© 2025 Venkat Chittoor. Licensed under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).
You are free to: view, learn from, share, and adapt this work for non-commercial purposes with attribution.
Commercial use — including use in client demonstrations, sales engagements, consulting deliverables, or any revenue-generating activity — requires explicit written permission from the author.
For commercial licensing inquiries: venkat.chittoor24@gmail.com