A small CrewAI-based multi-agent system for analyzing Kubernetes-style deployment logs, finding likely root causes, researching related issues online, and generating a remediation plan.
The workflow starts from a real log file, then uses three specialized AI agents in sequence:
- The Log Analyzer reads the log and extracts the main failure patterns.
- The Issue Investigator searches online for known causes and documented fixes.
- The Solution Specialist turns the findings into a practical remediation plan.
The system is designed to simulate a DevOps troubleshooting workflow with AI agents, tools, and basic guardrails.
The repository is organized as follows:
.
|-- main.py # Entry point that creates the crew and starts the workflow
|-- agents/
| `-- agents.py # Agent definitions and LLM configuration
|-- tasks/
| `-- tasks.py # Tasks, expected outputs, and guardrail validation
|-- tools/
| `-- tools.py # File reader and EXA web-search tool wiring
|-- kubernetes_log.log # Sample log used for the demo run
|-- task_outputs/ # Generated reports from the crew run
|-- requirements.txt # Python dependencies
`-- README.md
main.pyloads the sample log path and creates a Crew with three tasks.- The first task asks the Log Analyzer to inspect the log file.
- The second task uses the investigation agent to search for similar issues online.
- The third task uses the solution agent to summarize the result as a remediation plan.
- Task outputs are saved under
task_outputs/.
The current crew uses a sequential process:
- Process:
sequential - Agents:
log_analyzerissue_investigatorsolution_specialist
- Tasks:
analyze_logs_taskinvestigate_issue_taskprovide_solution_task
Each task depends on the previous findings for context:
- Analyze the raw log.
- Investigate the likely root cause.
- Generate a concrete solution.
This is the core multi-agent design of the project.
Role: analyze log files and identify incidents, errors, warnings, timelines, and likely root causes.
Responsibilities:
- Parse deployment and runtime log lines.
- Detect error patterns such as
ImagePullBackOff,CrashLoopBackOff, and sandbox failures. - Create a structured analysis report.
Role: research the identified problem using external search.
Responsibilities:
- Search the internet for related error messages.
- Gather official docs, forum posts, and known troubleshooting guidance.
- Rank likely causes and proven fixes.
Role: convert investigation findings into actionable remediation steps.
Responsibilities:
- Produce a step-by-step remediation plan.
- Include commands and verification steps.
- Recommend monitoring and prevention measures.
The log analyzer uses CrewAI's FileReadTool to inspect the sample log file.
The investigation agent uses EXASearchTool to search the public web for similar issues and community guidance.
The project includes two simple guardrails in tasks/tasks.py:
- The log analysis task validates that at least one error was found.
- The solution task validates that the final answer includes concrete shell command blocks.
If a guardrail fails, CrewAI retries the task. This helps reduce vague or incomplete outputs.
After a run, the project writes reports to task_outputs/:
log_analysis.mdinvestigation_report.mdsolution_plan.md
These files capture the different stages of the multi-agent troubleshooting process.
-
Use Python 3.13. The current dependency set is tested against Python 3.13.
-
Create a virtual environment:
py -3.13 -m venv .venv
-
Activate it in PowerShell:
.\.venv\Scripts\Activate.ps1
-
Install dependencies:
python -m pip install -r requirements.txt
-
Create a
.envfile in the project root and add the required API keys:EXA_API_KEY=your_exa_api_key
Add any LLM provider keys required by the model configuration in
agents/agents.py.
The active LLM configuration is defined in agents/agents.py.
The file currently reads OPENROUTER_MODEL into selected_model, but the agent LLM instances are configured directly in code. If you want runtime model switching, update the LLM(...) definitions to use environment variables for the model name and API key.
Security note: keep API keys in .env and avoid committing real credentials to source control.
From the project root:
python main.pyThe demo uses the sample input file:
kubernetes_log.log
When the run finishes, review the generated files in task_outputs/.
- The
.envfile is required because the agents and tools load environment variables at startup. EXA_API_KEYis required for online investigation.- The configured LLM provider key must have enough quota for the agent requests to succeed.
- The current system is intentionally simple and easy to extend with more agents, more tools, or more structured output formats.