🤖 Data Incident Agent

Autonomous AI agent that monitors a multi-layer Databricks lakehouse, reasons about anomalies using Claude API, and files structured incident reports — without hardcoded rules or expensive observability tooling.

Business Impact

Bad data costs organizations an estimated $12.9M annually on average (Gartner). This agent catches anomalies before they propagate from Bronze to Gold — protecting downstream business decisions from corrupted inputs.

Data pipelines break silently. A streaming table stops refreshing. A Silver enrichment job quietly fails. Downstream Gold tables go stale. Nobody knows until a business user asks why their dashboard hasn't updated — and that conversation happens at 9 AM in a Slack channel, in front of everyone.

Traditional monitoring tools solve this with rules: "alert if row count drops below X." That works — until your data changes shape, your pipeline evolves, or the threshold you set last quarter is no longer meaningful.

This agent takes a different approach.

The Approach: A Fire Investigator, Not a Smoke Detector

A traditional monitoring script is like a smoke detector — it trips when a hardcoded threshold is crossed. This agent is like a fire investigator — it looks at the same signals, decides if they're worth investigating, traces the root cause, and writes a report with prioritized recommendations.

Same data. Completely different intelligence.

The key distinction: no hardcoded thresholds anywhere in this codebase. The reasoning about what looks wrong, what to investigate deeper, and what to recommend lives entirely in the LLM — not in if/else logic.

Architecture: Eyes → Brain → Hands

┌─────────────────────────────────────────────────────────┐
│                     run_agent.py                        │
│                  (single entrypoint)                    │
└──────────┬──────────────────┬──────────────────┬────────┘
           │                  │                  │
           ▼                  ▼                  ▼
  collect_metrics.py       agent.py        report_writer.py
       [Eyes]               [Brain]            [Hands]

  "What does the        "What does it       "File the incident"
   lakehouse look        mean? Is anything
   like right now?"      wrong?"

Module	Role	What it does
`collect_metrics.py`	👁️ Eyes	Queries all 14 tables via Databricks REST API — row counts, null rates, freshness timestamps
`agent.py`	🧠 Brain	3-turn Claude API reasoning loop — observe, investigate, report
`report_writer.py`	🤝 Hands	Persists structured incident report to local JSON file + Databricks Delta table
`run_agent.py`	🎯 Entrypoint	Single CLI command that orchestrates the full pipeline

The Agentic Loop (What Makes This Different)

This is not a script. It is an agent. The difference matters.

A script executes a fixed sequence of steps regardless of what it finds. An agent observes, makes decisions, and adapts its next action based on what it learned.

Here is how the 3-turn reasoning loop works:

Turn 1 — Observe
  Feed all 14 table metrics to Claude
  Claude identifies suspicious tables and explains WHY
  No thresholds. Pure LLM judgment.
        ↓
Turn 2 — Investigate
  Agent fires targeted SQL queries ONLY on flagged tables
  Feeds deeper data back to Claude
  Claude refines its diagnosis — confirms or adjusts severity
        ↓
Turn 3 — Report
  Claude produces structured incident JSON
  Severity, root cause, recommended actions, priority order
  Agent enriches with run metadata and persists

The decision of which tables to investigate, how to interpret the deeper data, and what to recommend is made by Claude — not by the engineer who wrote the code.

What the Agent Monitors

14 tables across Bronze / Silver / Gold layers of a Medallion Architecture lakehouse:

Layer	Tables
🥉 Bronze	`bronze_customers`, `bronze_orders`, `bronze_products`, `bronze_order_items`, `bronze_orders_stream`
🥈 Silver	`silver_customers_enriched`, `silver_order_items`
🥇 Gold	`gold_customer_segments`, `gold_monthly_order_trends`, `gold_return_analysis`, `gold_revenue_by_category`, `gold_stream_anomalies`, `gold_top_customers`
🔧 Pipeline	`pipeline_runs` (meta-monitoring — is the pipeline itself healthy?)

For each table the agent collects:

Row count — volume anomalies, unexpected drops
Null rates — data quality degradation on key columns
Freshness — hours since last Delta table modification

Sample Output

============================================================
  🔴  OVERALL SEVERITY: HIGH
  📊 Tables monitored : 14
  🚨 Incidents found  : 4
  🕐 Run ID           : incident-20260422-022803
============================================================

  Core transactional pipelines healthy. Streaming and customer
  enrichment workflows broken for ~8 days.

  INCIDENTS:

  1. 🔴 [BRONZE] bronze_orders_stream
     Observation : Stale 7.7 days — 402 completed, 48 returned orders
     Likely cause: Streaming ingestion failure — connector stopped processing
     Action      : Check streaming service, verify source connectivity, restart job

  2. 🟡 [SILVER] silver_customers_enriched
     Observation : Stale 7.9 days — 10 records missing vs bronze layer
     Likely cause: ETL job failure or external enrichment API timeout
     Action      : Investigate 10 missing records, check API status, trigger manually

  3. 🟡 [GOLD] gold_customer_segments
     Likely cause: Downstream dependency on stale silver_customers_enriched
     Action      : Fix Silver first — Gold will auto-resolve

  4. 🔴 [GOLD] gold_stream_anomalies
     Likely cause: Depends on bronze_orders_stream — cannot process new events
     Action      : Fix Bronze first, then backfill 7.7-day anomaly gap

  IMMEDIATE ACTIONS:
  1. Restart streaming ingestion pipeline for bronze_orders_stream
  2. Check enrichment API and restart silver_customers_enriched ETL
  3. Backfill anomaly detection for 8-day gap once streaming restored

Notice what the agent did here — it identified two root causes in a four-incident fire and prioritized them correctly. Fix Bronze streaming → Gold anomalies auto-resolves. Fix Silver enrichment → Gold segments auto-resolves. A rules-based monitor would have filed 4 equal alerts and left the on-call engineer to figure out the dependency chain at 2 AM.

The Silent Failure Test — Where Agentic Shines

The most dangerous pipeline failures are not the ones that throw errors. They are the ones that succeed silently — the job completes, no alerts fire, but the data is wrong.

To validate this agent against a real-world scenario, a silent corruption was introduced by deleting all rows from gold_revenue_by_category except one category (Electronics). The pipeline reported no errors. No threshold was breached. The table existed and was queryable.

The agent caught it immediately:

5. 🔴 [GOLD] gold_revenue_by_category
   Observation : Only shows Electronics category with $421K revenue,
                 missing other product categories entirely
   Likely cause: Category dimension table is incomplete or join logic
                 is filtering out non-Electronics products
   Action      : Verify product category mapping in bronze_products
                 and check aggregation logic for missing categories

More importantly — notice what the agent did not say. It did not blame an upstream failure. It correctly reasoned that Bronze and Silver layers were healthy, which meant the problem was isolated to the Gold transformation itself — a filter bug or join issue, not a pipeline connectivity problem.

That cross-layer reasoning — "upstream is fine, therefore the fault is in the transformation" — is what separates this agent from a rules-based monitor. A threshold alert would have fired on row count. Only a reasoning agent understands why the row count is wrong and where in the dependency chain to look.

Why Not Mosaic AI?

Databricks Mosaic AI Agent Framework is excellent at executing predefined tool sequences — it can run SQL, check results, and take actions. But it requires you to encode the diagnostic logic explicitly.

This agent does something different. In Turn 1, Claude looks at all 14 tables holistically and forms a hypothesis about where in the dependency chain a failure originated — without being told what to look for. The targeted investigation queries in Turn 2 are fired based on that hypothesis, not a hardcoded checklist.

That is the difference between a tool-calling agent and a reasoning agent. The intelligence is not in the tools — it is in the judgment about which tools to use, in what order, and what the results mean.

Incident Reports as Data

Every agent run produces two outputs:

1. Local JSON file (reports/incident-<run-id>.json) Zero-friction, always readable, great for CLI workflows and debugging.

2. Databricks Delta table (workspace.ecommerce.incident_reports) Because incident reports are data — and data engineers store things as data.

Over time this table becomes a queryable history:

-- How often does bronze_orders_stream go stale?
SELECT generated_at, severity, incident_count
FROM workspace.ecommerce.incident_reports
ORDER BY generated_at DESC

That is observability you can chart, trend, and build alerts on. Not just a log file.

Setup

Prerequisites

Python 3.8+
Databricks workspace with Unity Catalog
Anthropic API key (console.anthropic.com)

Installation

git clone https://github.com/YOUR_USERNAME/data-incident-agent.git
cd data-incident-agent
python3 -m venv venv
source venv/bin/activate
pip install databricks-sdk requests anthropic python-dotenv

Configuration

Create a .env file in the project root:

DATABRICKS_HOST=https://your-workspace.azuredatabricks.net
DATABRICKS_TOKEN=your-personal-access-token
DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/your-warehouse-id
ANTHROPIC_API_KEY=your-anthropic-api-key

⚠️ .env is in .gitignore — your secrets will never be committed.

Usage

# Full run — collect metrics, reason, write report
python run_agent.py

# Dry run — Eyes + Brain only, skip writing report
python run_agent.py --dry-run

# Output raw JSON (useful for piping to other tools)
python run_agent.py --json

Exit codes

Code	Meaning
`0`	CLEAR / LOW / MEDIUM severity
`2`	HIGH severity — use this to trigger Databricks Job failure alerts

Why Not Just Use Databricks Native Monitoring?

Databricks natively handles:

DLT expectations (@dlt.expect) for declarative quality rules
Lakehouse Monitoring for drift detection
Workflow retry policies for self-healing jobs

What Databricks does not do:

Reason about whether a metric deviation actually matters in context
Understand dependency chains — which upstream failure caused which downstream staleness
Prioritize remediation steps based on blast radius
Explain root cause in plain language a non-engineer can act on

That reasoning layer is what this agent adds — and it is the gap that enterprise observability platforms charge $50K–$200K/year to fill.

Project Context

This project is part of a larger Databricks data engineering portfolio:

Repo	What it covers
`ecommerce-pipeline`	Medallion Architecture, PySpark, DLT, streaming, NL data assistant
`data-incident-agent`	Agentic AI, LLM reasoning loops, autonomous observability

The ecommerce-pipeline lakehouse is the data source this agent monitors — making these two repos a natural pair that together demonstrate the full spectrum from pipeline engineering to AI-powered operations.

Tech Stack

Databricks — lakehouse platform, Delta tables, SQL warehouses, REST API
Claude API (Anthropic) — LLM reasoning engine for the agentic loop
Databricks SDK — auto-credential detection, warehouse discovery
Python — requests, python-dotenv, databricks-sdk

Roadmap

v2 — DLT enrichment: Correlate incidents with DLT event log for deeper root cause (pipeline expectation failures, flow-level diagnostics)
v3 — Slack notifications: Push HIGH severity incidents to a Slack webhook
v4 — Scheduled Job: Run as a Databricks Job on a schedule with email alerting on exit code 2

Built by Venkat Chittoor in collaboration with Claude (Anthropic) — a demonstration that the engineers who thrive in the AI era are not those who know everything, but those who adapt fast, leverage the best tools available, and ship things that matter. The future belongs to those who adapt and adopt.

License & Attribution

This project was independently developed by Venkat Chittoor on personal time, using personal resources, and is not affiliated with or owned by any employer or client organization.

You are free to: view, learn from, share, and adapt this work for non-commercial purposes with attribution.

Commercial use — including use in client demonstrations, sales engagements, consulting deliverables, or any revenue-generating activity — requires explicit written permission from the author.

For commercial licensing inquiries: venkat.chittoor24@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 Data Incident Agent

Business Impact

The Approach: A Fire Investigator, Not a Smoke Detector

Architecture: Eyes → Brain → Hands

The Agentic Loop (What Makes This Different)

What the Agent Monitors

Sample Output

The Silent Failure Test — Where Agentic Shines

Why Not Mosaic AI?

Incident Reports as Data

Setup

Prerequisites

Installation

Configuration

Usage

Exit codes

Why Not Just Use Databricks Native Monitoring?

Project Context

Tech Stack

Roadmap

License & Attribution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
agent.py		agent.py
collect_metrics.py		collect_metrics.py
report_writer.py		report_writer.py
run_agent.py		run_agent.py

Folders and files

Latest commit

History

Repository files navigation

🤖 Data Incident Agent

Business Impact

The Approach: A Fire Investigator, Not a Smoke Detector

Architecture: Eyes → Brain → Hands

The Agentic Loop (What Makes This Different)

What the Agent Monitors

Sample Output

The Silent Failure Test — Where Agentic Shines

Why Not Mosaic AI?

Incident Reports as Data

Setup

Prerequisites

Installation

Configuration

Usage

Exit codes

Why Not Just Use Databricks Native Monitoring?

Project Context

Tech Stack

Roadmap

License & Attribution

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages