| Category | Badges |
|---|---|
| License | |
| PyPI | |
| CI |
A lightweight, zero-dependency Airflow provider that monitors DAG and task health by querying the metadata database.
No Prometheus. No Grafana. No Datadog. Just pip install and go.
| Detector | What it catches | How it works |
|---|---|---|
| Runtime anomaly | Tasks running unusually slow or fast | IQR-based outlier detection on task durations |
| Failure spike | Sudden increase in DAG failure rate | Compares recent failure rate vs historical baseline |
| Missed deadline | DAG runs taking too long | Flags running DAGs exceeding N× their median duration |
| Stuck task | Zombie or hung tasks | Flags tasks in running state beyond N× their historical max |
| Schedule anomaly | Tasks starting or ending at unusual times | IQR-based outlier detection on time-of-day (handles midnight wraparound) |
- Apache Airflow >= 3.0.0
- Python >= 3.10
- Any SQL metadata database supported by Airflow (PostgreSQL, MySQL, SQLite)
pip install airflow-provider-watchdogThat's it. The provider auto-registers:
- An
airflow_watchdog_monitorDAG that runs every 30 minutes (configurable) - A
/watchdog/dashboard accessible from the Airflow UI under Browse → Watchdog
Set an Airflow Variable called watchdog_config with a JSON object. All fields are optional — sensible defaults apply.
{
"schedule_interval_minutes": 30,
"lookback_runs": 20,
"runtime_iqr_multiplier": 1.5,
"failure_window_runs": 10,
"failure_baseline_runs": 50,
"failure_spike_ratio": 2.0,
"deadline_multiplier": 2.0,
"stuck_multiplier": 2.0,
"schedule_iqr_multiplier": 1.5,
"exclude_dags": [],
"disable_detectors": [],
"dag_overrides": {
"my_dag": {
"disable_detectors": ["schedule_anomaly"]
}
},
"alert_emails": ["team@example.com"],
"alert_slack_webhook": "https://hooks.slack.com/services/...",
"alert_teams_webhook": "https://outlook.office.com/webhook/...",
"alert_discord_webhook": "https://discord.com/api/webhooks/..."
}| Field | Default | Description |
|---|---|---|
schedule_interval_minutes |
30 |
How often the watchdog DAG runs |
lookback_runs |
20 |
Number of recent runs used for statistical baselines |
runtime_iqr_multiplier |
1.5 |
IQR multiplier for runtime anomaly fences |
failure_window_runs |
10 |
Recent window size for failure rate calculation |
failure_baseline_runs |
50 |
Historical baseline size for failure rate comparison |
failure_spike_ratio |
2.0 |
Alert when recent rate exceeds this × baseline rate |
deadline_multiplier |
2.0 |
Alert when DAG run exceeds this × median duration |
stuck_multiplier |
2.0 |
Alert when task exceeds this × historical max duration |
schedule_iqr_multiplier |
1.5 |
IQR multiplier for start/end time-of-day fences |
exclude_dags |
[] |
DAG IDs to skip (airflow_watchdog_monitor is always excluded) |
disable_detectors |
[] |
Detector names to disable globally (e.g. ["schedule_anomaly"]) |
dag_overrides |
{} |
Per-DAG overrides: {"dag_id": {"disable_detectors": [...]}} |
alert_emails |
[] |
Email addresses for alert notifications |
alert_slack_webhook |
null |
Slack incoming webhook URL |
alert_teams_webhook |
null |
MS Teams incoming webhook URL |
alert_discord_webhook |
null |
Discord incoming webhook URL |
┌─────────────────────────────────────────────────┐
│ airflow_watchdog_monitor DAG │
│ │
│ ┌─────────┐ ┌────────┐ ┌────────┐ ┌─────┐ ┌────────┐ │
│ │ Runtime │ │Failure │ │Deadline│ │Stuck│ │Schedule│ │
│ │Detector │ │Detector│ │Detector│ │Det. │ │Detector│ │
│ └────┬────┘ └───┬────┘ └───┬────┘ └──┬──┘ └───┬────┘ │
│ │ │ │ │ │ │
│ └──────────┴──────────┴─────────┴────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ Alerting │ │
│ │ Log/Email/Slack │ │
│ └────────┬────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ XCom (results) │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────┘
│
┌────────▼────────┐
│ /watchdog/ │
│ Dashboard │
│ (FastAPI) │
└─────────────────┘
Runtime anomaly (IQR): For each (dag_id, task_id), the detector computes Q1, Q3, and IQR from the last N successful runs. If the most recent duration falls outside [Q1 - 1.5×IQR, Q3 + 1.5×IQR], it's flagged. This is more robust than z-score because outliers don't skew the baseline.
Failure spike: Compares the failure rate in the last 10 runs against the rate in the last 50 runs. If the recent rate exceeds 2× baseline, it fires. Also catches DAGs that suddenly start failing when they historically never did.
Missed deadline: Checks currently-running DAG runs and compares their elapsed time against 2× median historical duration. Catches DAGs that are silently hanging.
Stuck task: Checks currently-running task instances against 2× historical max duration for that specific task. Catches zombie tasks, hung queries, and unresponsive external calls.
Schedule anomaly (IQR): For each (dag_id, task_id), converts start and end times to minutes-since-midnight and computes IQR fences. Flags tasks that started or ended at an unusual time-of-day. Handles midnight wraparound (e.g. tasks normally running between 23:30–00:30).
The dashboard is available at /watchdog/ in the Airflow webserver. It shows:
- Summary cards: total DAGs, healthy, warning, critical counts
- DAG health table: sorted with problems at the top
- Per-DAG alerts with severity indicators
- Auto-refreshes every 60 seconds
Access it via Browse → Watchdog in the Airflow UI navbar.
Alerts are dispatched through five channels:
- Airflow task logs — always on, visible in the
airflow_watchdog_monitorDAG run logs - Email — via Airflow's built-in
send_email(requires SMTP config inairflow.cfg) - Slack — via incoming webhook (set
alert_slack_webhookin config) - MS Teams — via incoming webhook with Adaptive Card (set
alert_teams_webhookin config) - Discord — via incoming webhook (set
alert_discord_webhookin config)
git clone https://github.com/Redevil10/airflow-provider-watchdog.git
cd airflow-provider-watchdog
uv sync --extra dev
uv run pytest- XCom-based dashboard — alert history is limited to the latest watchdog run. A future version may store results in a dedicated table for historical trending.
- Historical alert storage (dedicated table) for trend analysis
- Sparkline charts in the dashboard showing duration trends
- Per-DAG detector enable/disable via
dag_overridesconfig - Multi-database support (PostgreSQL, MySQL, SQLite)
- GitHub Actions CI (lint, test, publish)
- Contribution to the Airflow ecosystem page
Apache License 2.0 — see LICENSE.