airflow-provider-watchdog

Category	Badges
License
PyPI
CI

A lightweight, zero-dependency Airflow provider that monitors DAG and task health by querying the metadata database.

No Prometheus. No Grafana. No Datadog. Just pip install and go.

What it detects

Detector	What it catches	How it works
Runtime anomaly	Tasks running unusually slow or fast	IQR-based outlier detection on task durations
Failure spike	Sudden increase in DAG failure rate	Compares recent failure rate vs historical baseline
Missed deadline	DAG runs taking too long	Flags running DAGs exceeding N× their median duration
Stuck task	Zombie or hung tasks	Flags tasks in `running` state beyond N× their historical max
Schedule anomaly	Tasks starting or ending at unusual times	IQR-based outlier detection on time-of-day (handles midnight wraparound)

Requirements

Apache Airflow >= 3.0.0
Python >= 3.10
Any SQL metadata database supported by Airflow (PostgreSQL, MySQL, SQLite)

Installation

pip install airflow-provider-watchdog

That's it. The provider auto-registers:

An airflow_watchdog_monitor DAG that runs every 30 minutes (configurable)
A /watchdog/ dashboard accessible from the Airflow UI under Browse → Watchdog

Configuration

Set an Airflow Variable called watchdog_config with a JSON object. All fields are optional — sensible defaults apply.

{
    "schedule_interval_minutes": 30,
    "lookback_runs": 20,
    "runtime_iqr_multiplier": 1.5,
    "failure_window_runs": 10,
    "failure_baseline_runs": 50,
    "failure_spike_ratio": 2.0,
    "deadline_multiplier": 2.0,
    "stuck_multiplier": 2.0,
    "schedule_iqr_multiplier": 1.5,
    "exclude_dags": [],
    "disable_detectors": [],
    "dag_overrides": {
        "my_dag": {
            "disable_detectors": ["schedule_anomaly"]
        }
    },
    "alert_emails": ["team@example.com"],
    "alert_slack_webhook": "https://hooks.slack.com/services/...",
    "alert_teams_webhook": "https://outlook.office.com/webhook/...",
    "alert_discord_webhook": "https://discord.com/api/webhooks/..."
}

Configuration reference

Field	Default	Description
`schedule_interval_minutes`	`30`	How often the watchdog DAG runs
`lookback_runs`	`20`	Number of recent runs used for statistical baselines
`runtime_iqr_multiplier`	`1.5`	IQR multiplier for runtime anomaly fences
`failure_window_runs`	`10`	Recent window size for failure rate calculation
`failure_baseline_runs`	`50`	Historical baseline size for failure rate comparison
`failure_spike_ratio`	`2.0`	Alert when recent rate exceeds this × baseline rate
`deadline_multiplier`	`2.0`	Alert when DAG run exceeds this × median duration
`stuck_multiplier`	`2.0`	Alert when task exceeds this × historical max duration
`schedule_iqr_multiplier`	`1.5`	IQR multiplier for start/end time-of-day fences
`exclude_dags`	`[]`	DAG IDs to skip (`airflow_watchdog_monitor` is always excluded)
`disable_detectors`	`[]`	Detector names to disable globally (e.g. `["schedule_anomaly"]`)
`dag_overrides`	`{}`	Per-DAG overrides: `{"dag_id": {"disable_detectors": [...]}}`
`alert_emails`	`[]`	Email addresses for alert notifications
`alert_slack_webhook`	`null`	Slack incoming webhook URL
`alert_teams_webhook`	`null`	MS Teams incoming webhook URL
`alert_discord_webhook`	`null`	Discord incoming webhook URL

How it works

Architecture

┌─────────────────────────────────────────────────┐
│  airflow_watchdog_monitor DAG                   │
│                                                 │
│  ┌─────────┐ ┌────────┐ ┌────────┐ ┌─────┐ ┌────────┐ │
│  │ Runtime │ │Failure │ │Deadline│ │Stuck│ │Schedule│ │
│  │Detector │ │Detector│ │Detector│ │Det. │ │Detector│ │
│  └────┬────┘ └───┬────┘ └───┬────┘ └──┬──┘ └───┬────┘ │
│       │          │          │         │        │      │
│       └──────────┴──────────┴─────────┴────────┘      │
│                       │                         │
│              ┌────────▼────────┐                │
│              │    Alerting     │                │
│              │ Log/Email/Slack │                │
│              └────────┬────────┘                │
│                       │                         │
│              ┌────────▼────────┐                │
│              │  XCom (results) │                │
│              └─────────────────┘                │
└─────────────────────────────────────────────────┘
                        │
               ┌────────▼────────┐
               │   /watchdog/    │
               │   Dashboard     │
               │   (FastAPI)     │
               └─────────────────┘

Detection methods

Runtime anomaly (IQR): For each (dag_id, task_id), the detector computes Q1, Q3, and IQR from the last N successful runs. If the most recent duration falls outside [Q1 - 1.5×IQR, Q3 + 1.5×IQR], it's flagged. This is more robust than z-score because outliers don't skew the baseline.

Failure spike: Compares the failure rate in the last 10 runs against the rate in the last 50 runs. If the recent rate exceeds 2× baseline, it fires. Also catches DAGs that suddenly start failing when they historically never did.

Missed deadline: Checks currently-running DAG runs and compares their elapsed time against 2× median historical duration. Catches DAGs that are silently hanging.

Stuck task: Checks currently-running task instances against 2× historical max duration for that specific task. Catches zombie tasks, hung queries, and unresponsive external calls.

Schedule anomaly (IQR): For each (dag_id, task_id), converts start and end times to minutes-since-midnight and computes IQR fences. Flags tasks that started or ended at an unusual time-of-day. Handles midnight wraparound (e.g. tasks normally running between 23:30–00:30).

Dashboard

The dashboard is available at /watchdog/ in the Airflow webserver. It shows:

Summary cards: total DAGs, healthy, warning, critical counts
DAG health table: sorted with problems at the top
Per-DAG alerts with severity indicators
Auto-refreshes every 60 seconds

Access it via Browse → Watchdog in the Airflow UI navbar.

Alerting

Alerts are dispatched through five channels:

Airflow task logs — always on, visible in the airflow_watchdog_monitor DAG run logs
Email — via Airflow's built-in send_email (requires SMTP config in airflow.cfg)
Slack — via incoming webhook (set alert_slack_webhook in config)
MS Teams — via incoming webhook with Adaptive Card (set alert_teams_webhook in config)
Discord — via incoming webhook (set alert_discord_webhook in config)

Development

git clone https://github.com/Redevil10/airflow-provider-watchdog.git
cd airflow-provider-watchdog
uv sync --extra dev
uv run pytest

Known limitations

XCom-based dashboard — alert history is limited to the latest watchdog run. A future version may store results in a dedicated table for historical trending.

Roadmap

Historical alert storage (dedicated table) for trend analysis
Sparkline charts in the dashboard showing duration trends
Per-DAG detector enable/disable via dag_overrides config
Multi-database support (PostgreSQL, MySQL, SQLite)
GitHub Actions CI (lint, test, publish)
Contribution to the Airflow ecosystem page

License

Apache License 2.0 — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
airflow_watchdog		airflow_watchdog
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
prek.toml		prek.toml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

airflow-provider-watchdog

What it detects

Requirements

Installation

Configuration

Configuration reference

How it works

Architecture

Detection methods

Dashboard

Alerting

Development

Known limitations

Roadmap

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

airflow-provider-watchdog

What it detects

Requirements

Installation

Configuration

Configuration reference

How it works

Architecture

Detection methods

Dashboard

Alerting

Development

Known limitations

Roadmap

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages