Airflow Monitoring Agent is an advanced, autonomous multi-agent system designed to monitor, diagnose, and remediate Apache Airflow workflows. Built upon LangGraph, it orchestrates specialized AI agents that collaborate to detect failures, analyze root causes using task logs and source code, and execute recovery actions with human-in-the-loop approval.
This system decouples the reasoning engine (FastAPI Backend) from the user interface (Streamlit Frontend), enabling real-time, streaming feedback and interactive troubleshooting.
The system operates as a state machine where a shared AgentState is passed between nodes.
graph TD
Start((Start)) --> Monitor[π Monitor Agent]
Monitor -->|Failure Detected| Analyzer[π¬ Analyzer Agent]
Monitor -->|No Issues| End((End))
Analyzer -->|Report Generated| Interaction[π¬ User Interaction Agent]
Interaction -->|Wait for Input| Interaction
Interaction -->|User Decision: Action| Action[β‘ Action Agent]
Action -->|Action Complete| End
Action -->|Requires Follow-up| Interaction
| Component | Technology | Description |
|---|---|---|
| Server | FastAPI | Hosts the LangGraph workflow, WebSocket endpoints, and Airflow client services. |
| Client | Streamlit | Provides the chat UI, renders markdown reports, and manages user sessions. |
| Workflow | LangGraph | Defines the conditional routing logic between agents (e.g., should_analyze, should_act). |
| Database | JSON / Memory | Stores session states locally for persistence across UI refreshes. |
- Real-time Failure Detection: The
AirflowMonitorAgentqueries the Airflow REST API to identify failed DAG runs and Task Instances. - Context-Aware Monitoring: Supports monitoring specific DAG IDs, tasks, or a wildcard mode (
~) to scan the entire Airflow instance for recent errors. - Log Retrieval: Automatically fetches execution logs and metadata (try number, start/end dates) for failed tasks.
- Deep Diagnostics: The
ErrorAnalyzerAgentingests both the Task Logs and the DAG Source Code to understand the context of the failure. - Web-Enabled Debugging: Equipped with
DuckDuckGosearch tools, the agent can research specific error messages, library exceptions, or stack traces online to find relevant solutions. - Root Cause Identification: Generates a structured report detailing the root cause, error location, and suggested solutions.
- Interactive Decision Making: The
UserInteractionAgentpresents the analysis to the user and requests guidance (e.g., "Retry," "Skip," or "Manual Fix"). - Natural Language Interface: Users can issue commands or ask questions like "Analyze the failed task in dag_id X" or "Clear the task" directly through the chat interface.
- Action Execution: The
ActionAgentperforms the chosen operation via the Airflow API, such as Clearing Task Instances (triggering a retry) or marking tasks as success/skipped. - Validation: Verifies the result of the action and reports back to the user.
- LangGraph Orchestration: Uses a state-based graph architecture to manage the flow between monitoring, analysis, interaction, and action.
- Full Observability: Integrated with Langfuse to trace agent reasoning steps, token usage, and latency.
- Modern Stack: Built with FastAPI (Backend), Streamlit (Frontend), and WebSockets/SSE for real-time streaming.
airflow-agent/
βββ app/ # Streamlit Frontend
β βββ components/ # UI Components (Chat interface)
β βββ utils/ # Session state management
β βββ main.py # Client entry point
βββ server/ # FastAPI Backend
β βββ agents/ # Agent logic (Monitor, Analyzer, Interaction, Action)
β βββ routers/ # API & WebSocket endpoints
β βββ services/ # Airflow REST API Client
β βββ tools/ # Search tools (DuckDuckGo)
β βββ workflow/ # LangGraph state & graph definition
β βββ main.py # Server entry point
βββ sessions/ # Local storage for session data
βββ run_server.sh # Startup script for Backend
βββ run_client.sh # Startup script for Frontend
βββ .env.example # Configuration template
- Python 3.10+
- Running instance of Apache Airflow (API enabled)
- OpenAI API Key
# Clone the repository
git clone <repository-url>
cd airflow-agent
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtCopy the example environment file and configure your credentials.
cp .env.example .envRequired .env Variables:
# --- Airflow Connection ---
AIRFLOW_HOST=http://localhost:8080
AIRFLOW_USERNAME=admin
AIRFLOW_PASSWORD=admin
# --- LLM Provider ---
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o
# --- Observability (Optional) ---
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=[https://cloud.langfuse.com](https://cloud.langfuse.com)
# --- Server Config ---
API_BASE_URL=http://localhost:8000This system requires running the backend server and frontend client simultaneously.
The server handles the heavy lifting: agent orchestration, API calls, and reasoning.
./run_server.sh
# OR manually:
# python -m uvicorn server.main:app --reload --host 0.0.0.0 --port 8000- API Docs: http://localhost:8000/docs
- Health Check: http://localhost:8000/health
The client provides the visual interface to interact with the agents.
./run_client.sh
# OR manually:
# streamlit run app/main.py --server.port 8501- Access UI: http://localhost:8501
- Start a Session: Click "+ New Session" in the sidebar.
- Monitor: Type "Check failed DAGs" or provide a specific DAG ID.
- Analyze: The agent will detect errors and generate a report automatically.
- Resolve:
- Click or type "1" (or "Retry") to clear the task and restart it.
- Click or type "2" (or "Manual") to skip automated actions.
- Click or type "3" to see the full technical report.
- Airflow API Connection: Ensure your Airflow instance is running and the credentials in
.envhave API access permissions. TheAirflowClientuses Basic Auth. - WebSocket Errors: If the UI hangs, check the browser console for WebSocket connection errors. Ensure
API_BASE_URLmatches the running server address. - LLM Rate Limits: If using OpenAI, ensure your account has sufficient credits. The
ErrorAnalyzerAgentcan generate large prompts containing logs and code.