diff --git a/README.md b/README.md index 189678b99..d501d5e34 100644 --- a/README.md +++ b/README.md @@ -1,386 +1,77 @@ -# Azure SRE Agent β€” Resources +# πŸš€ Welcome to the Azure SRE Agent GitHub Repository! +We’re excited to launch this space for collaboration around the [SRE Agent](https://learn.microsoft.com/en-us/azure/sre-agent/overview), a key tool in our mission to improve service reliability and operational excellence. -This repository is the official community hub for Azure SRE Agent. Here you'll find: +## This repository is a community-driven hub where you can: +* πŸ› Report bugs encountered while using the SRE Agent +* πŸ’‘ Request features that would improve usability or functionality +* ❓ Share challenges or feedback related to using the product +* 🀝 Engage with the team and community to help shape the future of the SRE Agent -- **πŸ› Report Issues** β€” File bugs, feature requests, and feedback via [GitHub Issues](https://github.com/microsoft/sre-agent/issues) -- **πŸ“š Resources** β€” Curated links to docs, videos, blogs, and community content for Azure SRE Agent -- **πŸ§ͺ Labs** β€” Hands-on labs and sample environments to deploy, break, and fix apps with Azure SRE Agent (see the [`labs/`](labs/) folder) +> [!NOTE] +> This repo is not intended for integration-related issues. For those, please use the appropriate internal or partner support channels. ---- +## 🧼 Hygiene Guidelines for Creating Issues -## Quick Links +To help us keep things organized and productive, please follow these simple rules: +* Be descriptive: Include steps to reproduce, logs, screenshots, and thread ID where applicable. +* Use labels: Tag your issue appropriately (bug, feature-request, usability, etc.) to help with triage. +* Avoid duplicates: Search existing issues before creating a new one. +* Stay constructive: We welcome feedback, but please keep it respectful and focused. +* No personal data: Please do not include any personally identifiable information (PII) in your issue. -| Resource | Link | -|----------|------| -| Product Home Page | | -| Portal (Create & Manage Agents) | | -| Documentation | | -| Pricing & Billing | | -| All Blogs | | -| YouTube Channel | | -| GitHub β€” Azure SRE Agent (Report Issues, Official Labs & Resources) | | -| Hands-on Lab | | -| GitHub β€” Official Plugins | | -| Tech Community Discussions | | -| Agentic DevOps Live | | -| X (Twitter) | | +## 🧭 How to Find the Thread ID in SRE Agent ---- +Your direct chat interaction or incident is tracked as a thread in SRE Agent. Including the Thread ID in your GitHub issue helps us investigate quickly and accurately. A thread ID is a hex string like `50f7521d-dfee-487e-9188-5abdc8adde91`. -## Featured Videos +### πŸ” How to Locate the Thread ID: +**Get thread ID for threads under "Activities" view
** +Screenshot 2025-10-09 at 3 22 07β€―PM
-### What is Azure SRE Agent β€” Official Overview -The official Microsoft Azure product overview β€” a concise explainer of what Azure SRE Agent is, how it works, and the problems it solves. -πŸ”— Β· 6,156 views Β· 158 likes -### Microsoft AI SRE Agent: Fixing Bugs While You Sleep -Satya Nadella highlights Azure SRE Agent as a key example of AI-driven operations transforming how engineering teams manage reliability at scale. -πŸ”— Β· 2,548 views Β· 26 likes -### Azure SRE Agent: Less Toil, More Uptime, Maximum Innovation β€” Azure Friday -Scott Hanselman walks through Azure SRE Agent on Azure Friday, showing how it reduces operational toil and lets teams focus on innovation. -πŸ”— Β· 4,264 views Β· 75 likes +**Get thread ID for threads under Incident Management view
** +step 1:
+Screenshot 2025-10-09 at 3 21 51β€―PM
-### Root Cause Analysis with Code Context: Azure SRE Agent + GitHub Integration β€” GA Launch -The GA launch video demonstrating Azure SRE Agent performing root cause analysis with full code context through deep GitHub integration. -πŸ”— Β· 582 views Β· 25 likes -### Use Azure SRE Agent to Automate Tasks and Increase Site Reliability (DEM550) β€” Build -Deep-dive Build session covering end-to-end SRE Agent capabilities: automated investigation, remediation, proactive monitoring, and custom hooks. -πŸ”— Β· 12,294 views Β· 129 likes ---- -## More Videos +step2:
+Screenshot 2025-10-09 at 3 21 37β€―PM
-- [Fix It Before They Feel It: Proactive .NET Reliability with Azure SRE Agent](https://www.youtube.com/watch?v=Kx_6SB-mhgg) β€” dotnet Β· 1,466 views -- [Azure SRE Agent - Incident Management with PagerDuty](https://www.youtube.com/watch?v=5wrArcKzUaI) β€” Azure SRE Agent (official) Β· 547 views -- [Azure SRE Agent - Your 24/7 Automated Response Team](https://www.youtube.com/watch?v=xNTvYAoWvLU) β€” Mariusz Ferdyn Β· 313 views -- [Azure's New SRE Agent Is INSANE β€” Here's Why you Should Pay Attention](https://www.youtube.com/watch?v=2QdTfBZiASc) β€” TechTalks with Gil Β· 249 views -- [SRE Agent Series: What Is Azure SRE Agent and How to Create One Step by Step](https://www.youtube.com/watch?v=dvkfsbF0wmM) β€” JBSWiki Β· 204 views -- [Azure SRE Agent Explained](https://www.youtube.com/watch?v=B93WmYLQ6PE) β€” Cloud Talk with Jonnychipz Β· 160 views -- [SRE Agent Series: I Let an Azure SRE Agent Manage My Subscription β€” Here's What Happened](https://www.youtube.com/watch?v=rfwRvTTej-o) β€” JBSWiki Β· 143 views -- [Agentic DevOps: Azure SRE Agent with GitHub Copilot Coding Agent demo](https://www.youtube.com/watch?v=ZrpxNkUQ0C8) β€” Jorge Balderas Β· new ---- -## Blogs -### Post-GA (April 2026) -- **[Event-Driven IaC Operations: Terraform Drift Detection via HTTP Triggers](https://techcommunity.microsoft.com/blog/appsonazureblog/event-driven-iac-operations-with-azure-sre-agent-terraform-drift-detection-via-h/4512233)** β€” Vineela Suri Β· 10 min read. End-to-end pipeline: Terraform Cloud webhook triggers SRE Agent to classify drift as benign/risky/critical, correlate with incidents, and ship a fix β€” including a "DO NOT revert" recommendation that prevents turning a mitigated incident into an outage. -- **[Managing Multi-Tenant Azure Resources with SRE Agent and Lighthouse](https://techcommunity.microsoft.com/blog/appsonazureblog/managing-multi%E2%80%91tenant-azure-resource-with-sre-agent-and-lighthouse/4511789)** β€” Pranab Mandal Β· 6 min read. Step-by-step guide to configuring Azure Lighthouse delegation so a single SRE Agent can monitor and manage resources across multiple tenants β€” covering ARM templates, RBAC roles, and managed identity setup. -- **[New in Azure SRE Agent: Log Analytics and Application Insights Connectors](https://techcommunity.microsoft.com/blog/appsonazureblog/new-in-azure-sre-agent-log-analytics-and-application-insights-connectors/4509649)** β€” Dalibor Kovacevic Β· 3 min read. Native MCP-backed connectors for Log Analytics and App Insights β€” connect a workspace, auto-grant RBAC, and the agent queries ContainerLog, Syslog, exceptions, and traces directly during investigations. -- **[Azure Monitor in Azure SRE Agent: Autonomous Alert Investigation and Intelligent Merging](https://techcommunity.microsoft.com/blog/appsonazureblog/azure-monitor-in-azure-sre-agent-autonomous-alert-investigation-and-intelligent-/4509069)** β€” Vineela Suri Β· 9 min read. Full walkthrough of Azure Monitor integration: Incident Response Plans, alert merging (7 firings β†’ 1 thread), auto-resolve trade-offs, and a live AKS + Redis scenario where the agent fixes a bad credential autonomously. -- **[3 Ways to Get More from Azure SRE Agent](https://techcommunity.microsoft.com/blog/appsonazureblog/3-ways-to-get-more-from-azure-sre-agent/4508993)** β€” dchelupati Β· 4 min read. Practical cost and value tips: start narrow with incident routing, replace high-frequency polling with push/batch patterns, and keep scheduled task threads fresh with "new chat thread for each run." -- **[How We Build and Use Azure SRE Agent with Agentic Workflows](https://techcommunity.microsoft.com/blog/appsonazureblog/how-we-build-and-use-azure-sre-agent-with-agentic-workflows/4508753)** β€” Shamir AbdulAziz Β· 6 min read. Customer Zero blog: how Microsoft embedded agents across the SDLC to build SRE Agent β€” 35K+ incidents handled, 50K+ developer hours saved, App Service time-to-mitigation down from 40.5 hours to 3 minutes. -- **[An Update to the Active Flow Billing Model](https://aka.ms/sreagent/pricing/blog)** β€” Mayunk Jain Β· 3 min read. Active flow billing moves from time-based to token-based usage, with per-model-provider AAU rates. Always-on pricing unchanged at 4 AAUs per agent-hour. +## πŸ“ Issue Template +When creating a new issue, please use the following format: -### GA Launch (March 2026) +**Issue Description** +Briefly describe the problem or request. -- **[Announcing General Availability for the Azure SRE Agent](https://aka.ms/sreagent/ga)** β€” Mayunk Jain Β· 4 min read. GA announcement: 1,300+ agents deployed internally at Microsoft, 35K+ incidents mitigated, 20K+ engineering hours saved. Covers deep context, built-in computation, memory and learning, and Ecolab customer story. -- **[What's New in Azure SRE Agent in the GA Release](https://aka.ms/sreagent/blog/whatsnewGA)** β€” dchelupati Β· 2 min read. Companion to the GA announcement: redesigned onboarding, deep context, code interpreter, memory, skills, subagents, Python tools, agent hooks, and MCP connectors. -- **[The Agent That Investigates Itself (SRE4SRE)](https://aka.ms/sreagent/blogs/sre4sre)** β€” Sanchit Mehta Β· 11 min read. Deep technical post β€” the SRE Agent investigating its own KV cache regression, demonstrating how the team uses the product to maintain the product. -- **[Azure SRE Agent Now Builds Expertise Like Your Best Engineer (Deep Context)](https://aka.ms/sreagent/blogs/deepcontextblog)** β€” dchelupati Β· 6 min read. How the agent operates with continuous access to source code, persistent memory across investigations, and background intelligence that runs when nobody is asking questions. -- **[What It Takes to Give SRE Agent a Useful Starting Point (Onboarding)](https://aka.ms/sreagent/blogs/onboardingtosrea)** β€” Dalibor Kovacevic Β· 10 min read. Designing the guided onboarding flow: connecting code, logs, incidents, Azure resources, and knowledge files so a new agent becomes useful on day one. -- **[Agent Hooks: Production-Grade Governance for Azure SRE Agent](https://aka.ms/sreagent/blogs/agenthooks)** β€” Vineela Suri Β· 9 min read. Governance primitives for controlling agent behavior: stop hooks, PostToolUse hooks, and global hooks that enforce approval gates and safety boundaries. -- **[An AI-Led SDLC: Building an End-to-End Agentic Software Development Lifecycle with Azure and GitHub](https://techcommunity.microsoft.com/blog/appsonazureblog/an-ai-led-sdlc-building-an-end-to-end-agentic-software-development-lifecycle-wit/4491896)** β€” owaino Β· 16 min read. Full agentic SDLC walkthrough: Spec-Kit β†’ GitHub Coding Agent β†’ Code Quality β†’ CI/CD β†’ SRE Agent β€” with the SRE Agent closing the loop by opening GitHub issues for the coding agent to fix. +**Agent Name** +name of Agent -### Pre-GA (December 2025) +**Subscription ID** +subscription in which agent is deployed -- **[Context Engineering: Lessons from Building Azure SRE Agent](https://techcommunity.microsoft.com/blog/appsonazureblog/context-engineering-lessons-from-building-azure-sre-agent/4481200)** β€” Sanchit Mehta Β· 8 min read. Engineering lessons: started with 100+ tools and 50+ specialized agents, ended with 5 core tools and generalist agents β€” why less is more in agent design. +**Region** +Region where agent is deployed ---- +**Resource group** +For Agent deployment related issues, provide the resource group in which it was created -## GitHub Repos +**Thread ID** +Paste the thread ID from the SRE Agent portal (e.g., 50f7521d-dfee-487e-9188-5abdc8adde91) -| Repo | Stars | Description | -|------|------:|-------------| -| [microsoft/sre-agent](https://github.com/microsoft/sre-agent) | 83 | Official hands-on lab β€” sample environments, walkthroughs, and prompt guides | -| [matthansen0/azure-sre-agent-sandbox](https://github.com/matthansen0/azure-sre-agent-sandbox) | 52 | Fully automated sandbox deployment with AKS break-fix scenarios | -| [paulasilvatech/Agentic-Ops-Dev](https://github.com/paulasilvatech/Agentic-Ops-Dev) | 23 | Agentic Operations & Observability Workshop | -# Azure SRE Agent Hands-On Lab +**Steps to Reproduce** +1. Describe the action you took +2. Mention the resource or Azure service (if involved) +3. Describe what you expected vs. what happened +4. include error messages experienced by you in Incident or chat threads or ARM deployment error details or HTTP status codes -Deploy an Azure SRE Agent connected to a sample application with a single `azd up` command. Watch it diagnose and remediate issues autonomously. +**Expected Behavior** +What should happen? -**Learn more:** [What is Azure SRE Agent?](https://sre.azure.com/docs/overview) - -## Architecture - -

- Lab Architecture -

- -## Prerequisites - -### Required Tools - -| Tool | macOS | Windows | -|------|-------|---------| -| [Azure CLI](https://learn.microsoft.com/cli/azure/install-azure-cli) 2.60+ | `brew install azure-cli` | `winget install Microsoft.AzureCLI` | -| [Azure Developer CLI](https://learn.microsoft.com/azure/developer/azure-developer-cli/install-azd) 1.9+ | `brew install azd` | `winget install Microsoft.Azd` | -| [Git](https://git-scm.com/) 2.x | `brew install git` | `winget install Git.Git` (includes Git Bash) | -| [Python](https://python.org) 3.10+ | `brew install python3` | `winget install Python.Python.3.12` | - -> **Windows note:** After installing Python, disable the Windows Store app aliases: -> **Settings β†’ Apps β†’ Advanced app settings β†’ App execution aliases** β†’ turn OFF `python.exe` and `python3.exe` - -### Azure Requirements - -- Active Azure subscription -- **Owner** role on the subscription (needed for RBAC role assignments) -- Register the resource provider: - ```bash - az provider register -n Microsoft.App --wait - ``` - -### Optional - -- GitHub account (for code search and issue triage scenarios β€” uses OAuth sign-in, or a [fine-grained PAT](https://github.com/settings/personal-access-tokens/new) scoped to your fork with `Contents:Read`, `Issues:Read+Write`, `Metadata:Read` for least-privilege access) - -## Quick Start - -### Check prerequisites - -Run the prereqs script to verify everything is installed: - -```bash -# macOS/Linux -bash scripts/prereqs.sh - -# Windows (Git Bash or CMD) -"C:\Program Files\Git\bin\bash.exe" scripts/prereqs.sh -``` - -### macOS / Linux - -```bash -# 1. Clone the repo -git clone https://github.com/dm-chelupati/sre-agent-lab.git -cd sre-agent-lab -git submodule update --init --recursive - -# 2. Sign in to Azure -az login -azd auth login - -# 3. Create environment and deploy -azd env new sre-lab -azd up -# Select your subscription and eastus2 as the region -``` - -### Windows - -```cmd -REM 1. Clone the repo (in CMD or PowerShell) -git clone https://github.com/dm-chelupati/sre-agent-lab.git -cd sre-agent-lab -git submodule update --init --recursive - -REM 2. Sign in to Azure -az login -azd auth login - -REM 3. Create environment and deploy -azd env new sre-lab -azd up - -REM If post-provision fails with 'bash not found' or 'Python not found': -set PATH=%PATH%;C:\Users\%USERNAME%\AppData\Local\Programs\Python\Python312 -"C:\Program Files\Git\bin\bash.exe" scripts/post-provision.sh -``` - -Deployment takes ~8-12 minutes. - -## What Gets Deployed - -### Azure Infrastructure (via Bicep) - -| Resource | Service | Purpose | Docs | -|----------|---------|---------|------| -| SRE Agent | `Microsoft.App/agents` | AI agent for incident investigation | [Overview](https://sre.azure.com/docs/overview) | -| Grubify API | Azure Container Apps | Sample app to monitor | | -| Grubify Frontend | Azure Container Apps | Sample app UI | | -| Log Analytics | `Microsoft.OperationalInsights` | Log storage for KQL queries | [Azure Observability](https://sre.azure.com/docs/capabilities/diagnose-azure-observability) | -| App Insights | `Microsoft.Insights` | Request tracing and exceptions | | -| Alert Rules | `Microsoft.Insights/metricAlerts` | HTTP 5xx and error log alerts | | -| Managed Identity | `Microsoft.ManagedIdentity` | Agent identity for Azure access | [Permissions](https://sre.azure.com/docs/tutorials/agent-config/manage-permissions) | -| Container Registry | `Microsoft.ContainerRegistry` | Grubify container images | | - -### RBAC Roles Assigned - -| Role | Scope | Purpose | -|------|-------|---------| -| SRE Agent Administrator | Agent resource | User can manage agent via data plane APIs | -| Reader | Resource group | Agent can read all resources | -| Monitoring Reader | Resource group | Agent can read metrics and alerts | -| Log Analytics Reader | Log Analytics workspace | Agent can query logs via KQL | - -See: [Manage Permissions](https://sre.azure.com/docs/tutorials/agent-config/manage-permissions) - -### SRE Agent Configuration (via post-provision script) - -| Component | Purpose | Docs | -|-----------|---------|------| -| Knowledge Base | HTTP error runbook, app architecture, incident template | [Memory & Knowledge](https://sre.azure.com/docs/concepts/memory) | -| incident-handler subagent | Investigates alerts using logs, metrics, runbooks | [Custom Agents](https://sre.azure.com/docs/concepts/subagents) | -| Response Plan | Routes HTTP 500 alerts to incident-handler | [Response Plans](https://sre.azure.com/docs/capabilities/incident-response-plans) | -| Azure Monitor | Incident platform β€” alerts flow to the agent | [Incident Platforms](https://sre.azure.com/docs/concepts/incident-platforms) | -| GitHub OAuth connector | Code search and issue management (optional) | [Connectors](https://sre.azure.com/docs/concepts/connectors) | -| code-analyzer subagent | Source code root cause analysis | [Custom Agents](https://sre.azure.com/docs/concepts/subagents) | -| issue-triager subagent | Automated issue triage from runbook | [Custom Agents](https://sre.azure.com/docs/concepts/subagents) | - -> **Note on GitHub tools:** GitHub OAuth tools (code search, issue management) are **built-in native tools**, not MCP tools. Once the GitHub OAuth connector is set up, all agents β€” including subagents β€” get access to GitHub tools automatically through global settings. No explicit `mcp_tools` assignment is needed in subagent YAML. This is different from MCP connector tools (Datadog, Splunk, etc.) which require explicit `mcp_tools` assignment. -| Scheduled Task | Triage customer issues every 12 hours | [Scheduled Tasks](https://sre.azure.com/docs/capabilities/scheduled-tasks) | -| Code Repo | Agent indexes the Grubify source code | [Deep Context](https://sre.azure.com/docs/concepts/workspace-tools) | - -## Post-Deployment - -### Re-run the setup script - -```bash -# Full re-run (rebuilds container images + re-uploads everything) -./scripts/post-provision.sh - -# Skip container image builds (just update KB, subagents, response plan) -./scripts/post-provision.sh --retry - -# Windows: run from CMD with Python in PATH -set PATH=%PATH%;C:\Users\%USERNAME%\AppData\Local\Programs\Python\Python312 -"C:\Program Files\Git\bin\bash.exe" scripts/post-provision.sh --retry -``` - -### Manual container deploy (Windows fallback) - -If the script deploys images but the app still shows the default page: - -```cmd -for /f "tokens=*" %a in ('azd env get-value AZURE_CONTAINER_REGISTRY_NAME') do set ACR=%a -for /f "tokens=*" %a in ('azd env get-value CONTAINER_APP_NAME') do set APP=%a -for /f "tokens=*" %a in ('azd env get-value FRONTEND_APP_NAME') do set FE=%a -az containerapp update --name %APP% --resource-group rg-sre-lab --image %ACR%.azurecr.io/grubify-api:latest -az containerapp update --name %FE% --resource-group rg-sre-lab --image %ACR%.azurecr.io/grubify-frontend:latest -``` - -## Verify Setup - -After deployment completes, open your agent at [sre.azure.com](https://sre.azure.com) and click **Full setup**. You should see green checkmarks on: - -| Card | Expected Status | -|------|----------------| -| **Code** | βœ… 1 repository | -| **Incidents** | βœ… Connected to Azure Monitor | -| **Azure resources** | βœ… 1 resource group added | -| **Knowledge files** | βœ… 1 file | - -> **Checkpoint:** If any card is missing a checkmark, re-run the post-provision script: `bash scripts/post-provision.sh --retry` - -Once verified, click **"Done and go to agent"** to open the agent chat and start the team onboarding conversation. - -### Team Onboarding - -The agent opens a **"Team onboarding"** thread automatically. It will: - -1. **Explore your connected context** β€” reads the code repository, Azure resources, and knowledge files you connected during setup -2. **Interview you about your team** β€” ask about your team structure, on-call rotation, services you own, and escalation paths - -Since the agent already has context from setup, try asking it questions: - -> *"What do you know about the Grubify app architecture?"* -> -> *"Summarize the HTTP errors runbook"* -> -> *"What Azure resources are in my resource group?"* - -The agent saves your team information to persistent memory and references it in every future investigation. - -> **Tip:** Ask *"What should I do next?"* for personalized recommendations based on what's connected. - -## Lab Scenarios - -### Scenario 1: IT Operations (No GitHub required) - -Break the app and watch the agent investigate: - -```bash -./scripts/break-app.sh # macOS/Linux -# Windows: "C:\Program Files\Git\bin\bash.exe" scripts/break-app.sh -``` - -Then open [sre.azure.com](https://sre.azure.com) β†’ Incidents to watch the agent: -1. Detect the Azure Monitor alert -2. Query Log Analytics for error patterns -3. Reference the HTTP errors runbook -4. Apply remediation (restart/scale) -5. Summarize with root cause and evidence - -### Scenario 2: Developer (Requires GitHub) - -Ask the agent to search source code for root causes: -- File:line references to problematic code -- Correlation of production errors to code changes -- Suggested fixes with before/after examples - -### Scenario 3: Workflow Automation (Requires GitHub) - -Create sample support issues and let the agent triage them: - -```bash -./scripts/create-sample-issues.sh -``` - -The agent classifies issues (Documentation, Bug, Feature Request), applies labels, and posts triage comments following the runbook. - -## Adding GitHub Later - -After initial setup, add GitHub by signing in via the OAuth URL: - -```bash -./scripts/setup-github.sh # macOS/Linux -# Windows: "C:\Program Files\Git\bin\bash.exe" scripts/setup-github.sh -``` - -> **Security tip:** The OAuth flow requests broad repo access. For least-privilege, -> use a [fine-grained PAT](https://github.com/settings/personal-access-tokens/new) -> scoped to your grubify fork only with permissions: `Contents:Read`, `Issues:Read+Write`, `Metadata:Read`. -> ```bash -> export GITHUB_PAT=github_pat_xxxx -> ./scripts/setup-github.sh -> ``` - -## Cleanup - -```bash -azd down --purge -``` - -## Troubleshooting - -| Issue | Fix | -|-------|-----| -| `'bash' is not recognized` (Windows) | Run via: `"C:\Program Files\Git\bin\bash.exe" scripts/post-provision.sh` | -| `Python was not found` (Windows) | Install: `winget install Python.Python.3.12`, disable App execution aliases | -| `curl: error encountered when reading a file` | Python isn't in Git Bash PATH: `export PATH="$PATH:/c/Users/$USER/AppData/Local/Programs/Python/Python312"` | -| `roleAssignments/write` denied | Need Owner role on subscription. Check: `az role assignment list --assignee $(az ad signed-in-user show --query id -o tsv)` | -| `Microsoft.App not registered` | Run: `az provider register -n Microsoft.App --wait` | -| Grubify shows default page after deploy | Run manual deploy commands (see Post-Deployment section above) | -| Post-provision 405 on response plan | Wait 30s and run: `./scripts/post-provision.sh --retry` | -| Agent can't create issues on forked repo | Forks have Issues disabled by default. Enable: repo Settings β†’ Features β†’ Issues βœ…, or run `gh api -X PATCH repos/OWNER/REPO -f has_issues=true` | - -## Regions - -SRE Agent is available in: `eastus2`, `swedencentral`, `australiaeast` - -## Links - -- [Azure SRE Agent Documentation](https://sre.azure.com/docs) -- [Getting Started Guide](https://sre.azure.com/docs/get-started/create-and-setup) -- [Connectors](https://sre.azure.com/docs/concepts/connectors) -- [Custom Agents](https://sre.azure.com/docs/concepts/subagents) -- [Incident Response](https://sre.azure.com/docs/capabilities/incident-response) -- [Azure Observability](https://sre.azure.com/docs/capabilities/diagnose-azure-observability) - -## License - -MIT +**Actual Behavior** +What actually happened diff --git a/labs/starter-lab/README.md b/labs/starter-lab/README.md index 0f3a19e18..123831f9a 100644 --- a/labs/starter-lab/README.md +++ b/labs/starter-lab/README.md @@ -1,6 +1,8 @@ -# Azure SRE Agent β€” Starter Lab +# Azure SRE Agent Hands-On Lab -Deploy an Azure SRE Agent, break a sample app, and watch it diagnose and fix the issue. **~40 minutes.** +Deploy an Azure SRE Agent connected to a sample application with a single `azd up` command. Watch it diagnose and remediate issues autonomously. + +**Learn more:** [What is Azure SRE Agent?](https://sre.azure.com/docs/overview) ## Architecture @@ -8,200 +10,235 @@ Deploy an Azure SRE Agent, break a sample app, and watch it diagnose and fix the Lab Architecture

-## What Gets Deployed - -| Resource | Purpose | -|----------|---------| -| **SRE Agent** | AI agent with managed identity, knowledge base, custom agents | -| **Grubify App** | Sample food ordering app (API + Frontend on Container Apps) | -| **Log Analytics + App Insights** | Monitoring and log storage | -| **Azure Monitor Alert** | HTTP 5xx alert β†’ auto-triggers agent investigation | -| **Container Registry** | Grubify container images | -| **Managed Identity** | Reader + Monitoring Reader + Log Analytics Reader RBAC | - -### SRE Agent Configuration - -| Component | Purpose | -|-----------|---------| -| **Knowledge Base** | HTTP error runbook, app architecture docs | -| **incident-handler** | Investigates using logs, KQL, runbooks | -| **code-analyzer** | Same + source code search, creates GitHub issues | -| **issue-triager** | Triages customer issues with labels and comments | -| **Response Plan** | Routes alerts to custom agents autonomously | -| **GitHub OAuth** | Code search + issue management (optional) | -| **Scheduled Task** | Triage issues every 12 hours (optional) | -| **Global Tools** | DevOps + Python plotting enabled | - -## Lab Scenarios - -| # | Scenario | Persona | GitHub Required? | -|---|----------|---------|:---:| -| 1 | **Break app β†’ Agent investigates logs + remediates** | IT Operations | No | -| 2 | **Same break β†’ Agent finds root cause in source code + creates GitHub issue** | Developer + IT | Yes | -| 3 | **Triage customer issues β†’ classify, label, comment** | Workflow Automation | Yes | - ## Prerequisites +### Required Tools + | Tool | macOS | Windows | |------|-------|---------| | [Azure CLI](https://learn.microsoft.com/cli/azure/install-azure-cli) 2.60+ | `brew install azure-cli` | `winget install Microsoft.AzureCLI` | | [Azure Developer CLI](https://learn.microsoft.com/azure/developer/azure-developer-cli/install-azd) 1.9+ | `brew install azd` | `winget install Microsoft.Azd` | -| [Git](https://git-scm.com/) 2.x | `brew install git` | `winget install Git.Git` | +| [Git](https://git-scm.com/) 2.x | `brew install git` | `winget install Git.Git` (includes Git Bash) | | [Python](https://python.org) 3.10+ | `brew install python3` | `winget install Python.Python.3.12` | -> **Windows:** After installing Python, disable Store aliases: **Settings β†’ Apps β†’ App execution aliases** β†’ turn OFF `python.exe` and `python3.exe` +> **Windows note:** After installing Python, disable the Windows Store app aliases: +> **Settings β†’ Apps β†’ Advanced app settings β†’ App execution aliases** β†’ turn OFF `python.exe` and `python3.exe` ### Azure Requirements -- Active Azure subscription with **Owner** role -- Register: `az provider register -n Microsoft.App --wait` +- Active Azure subscription +- **Owner** role on the subscription (needed for RBAC role assignments) +- Register the resource provider: + ```bash + az provider register -n Microsoft.App --wait + ``` ### Optional -- [GitHub account](https://github.com) β€” fork [dm-chelupati/grubify](https://github.com/dm-chelupati/grubify/fork) for Scenarios 2 & 3 +- GitHub account (for code search and issue triage scenarios β€” uses OAuth sign-in, no PAT needed) ## Quick Start -### One-Command Setup (Recommended) +### Check prerequisites -The `setup.sh` script handles everything: login, deploy, and configure. +Run the prereqs script to verify everything is installed: -**macOS / Linux:** ```bash -git clone https://github.com/microsoft/sre-agent.git -cd sre-agent/labs/starter-lab -bash scripts/setup.sh -``` +# macOS/Linux +bash scripts/prereqs.sh -**Windows:** -```cmd -git clone https://github.com/microsoft/sre-agent.git -cd sre-agent\labs\starter-lab -"C:\Program Files\Git\bin\bash.exe" scripts/setup.sh +# Windows (Git Bash or CMD) +"C:\Program Files\Git\bin\bash.exe" scripts/prereqs.sh ``` -The script will: -1. Check prerequisites -2. Sign in to Azure (`--use-device-code`) -3. Sign in to Azure Developer CLI -4. Register resource providers -5. Ask for GitHub username (optional) -6. Deploy infrastructure (~5-8 min) -7. Configure the SRE Agent +### macOS / Linux -### Manual Setup +```bash +# 1. Clone the repo +git clone https://github.com/dm-chelupati/sre-agent-lab.git +cd sre-agent-lab +git submodule update --init --recursive -If you prefer to run each step yourself: +# 2. Sign in to Azure +az login +azd auth login -```bash -az login --use-device-code -azd auth login --use-device-code -az provider register -n Microsoft.App --wait +# 3. Create environment and deploy +azd env new sre-lab +azd up +# Select your subscription and eastus2 as the region +``` + +### Windows +```cmd +REM 1. Clone the repo (in CMD or PowerShell) +git clone https://github.com/dm-chelupati/sre-agent-lab.git +cd sre-agent-lab +git submodule update --init --recursive + +REM 2. Sign in to Azure +az login +azd auth login + +REM 3. Create environment and deploy azd env new sre-lab -azd env set AZURE_LOCATION eastus2 -# Optional: azd env set GITHUB_USER azd up -bash scripts/post-provision.sh +REM If post-provision fails with 'bash not found' or 'Python not found': +set PATH=%PATH%;C:\Users\%USERNAME%\AppData\Local\Programs\Python\Python312 +"C:\Program Files\Git\bin\bash.exe" scripts/post-provision.sh ``` -## Verify Setup +Deployment takes ~8-12 minutes. -Open [sre.azure.com](https://sre.azure.com) β†’ Full Setup β†’ verify: -- **Code**: 1 repository (if GitHub connected) -- **Incidents**: Connected to Azure Monitor -- **Azure resources**: 1 resource group -- **Knowledge sources**: runbook files indexed +## What Gets Deployed -## Scenario 1: IT Operations (No GitHub) +### Azure Infrastructure (via Bicep) -Break the app and ask the agent to investigate using logs and knowledge base. +| Resource | Service | Purpose | Docs | +|----------|---------|---------|------| +| SRE Agent | `Microsoft.App/agents` | AI agent for incident investigation | [Overview](https://sre.azure.com/docs/overview) | +| Grubify API | Azure Container Apps | Sample app to monitor | | +| Grubify Frontend | Azure Container Apps | Sample app UI | | +| Log Analytics | `Microsoft.OperationalInsights` | Log storage for KQL queries | [Azure Observability](https://sre.azure.com/docs/capabilities/diagnose-azure-observability) | +| App Insights | `Microsoft.Insights` | Request tracing and exceptions | | +| Alert Rules | `Microsoft.Insights/metricAlerts` | HTTP 5xx and error log alerts | | +| Managed Identity | `Microsoft.ManagedIdentity` | Agent identity for Azure access | [Permissions](https://sre.azure.com/docs/tutorials/agent-config/manage-permissions) | +| Container Registry | `Microsoft.ContainerRegistry` | Grubify container images | | -```bash -# macOS/Linux -bash scripts/break-app.sh +### RBAC Roles Assigned -# Windows -"C:\Program Files\Git\bin\bash.exe" scripts/break-app.sh -``` +| Role | Scope | Purpose | +|------|-------|---------| +| SRE Agent Administrator | Agent resource | User can manage agent via data plane APIs | +| Reader | Resource group | Agent can read all resources | +| Monitoring Reader | Resource group | Agent can read metrics and alerts | +| Log Analytics Reader | Log Analytics workspace | Agent can query logs via KQL | -1. Open the Grubify frontend β€” try adding to cart (it's broken!) -2. Start a **new chat** β†’ type `/` β†’ select any custom agent -3. Send: - ``` - The Grubify API is not responding β€” specifically the "Add to Cart" is failing. - Can you investigate, find the root cause, and create a GitHub issue with your detailed findings? - ``` -4. Agent investigates: searches memory, queries KQL, references runbook, identifies memory leak -5. Ask: `Can you mitigate this issue?` -6. Verify recovery in browser +See: [Manage Permissions](https://sre.azure.com/docs/tutorials/agent-config/manage-permissions) -> **Automated Alert:** After 10-15 min, check **Activities β†’ Incidents** β€” Azure Monitor may have fired an alert and the agent investigated autonomously. +### SRE Agent Configuration (via post-provision script) -## Scenario 2: Developer (Requires GitHub) +| Component | Purpose | Docs | +|-----------|---------|------| +| Knowledge Base | HTTP error runbook, app architecture, incident template | [Memory & Knowledge](https://sre.azure.com/docs/concepts/memory) | +| incident-handler subagent | Investigates alerts using logs, metrics, runbooks | [Custom Agents](https://sre.azure.com/docs/concepts/subagents) | +| Response Plan | Routes HTTP 500 alerts to incident-handler | [Response Plans](https://sre.azure.com/docs/capabilities/incident-response-plans) | +| Azure Monitor | Incident platform β€” alerts flow to the agent | [Incident Platforms](https://sre.azure.com/docs/concepts/incident-platforms) | +| GitHub OAuth connector | Code search and issue management (optional) | [Connectors](https://sre.azure.com/docs/concepts/connectors) | +| code-analyzer subagent | Source code root cause analysis | [Custom Agents](https://sre.azure.com/docs/concepts/subagents) | +| issue-triager subagent | Automated issue triage from runbook | [Custom Agents](https://sre.azure.com/docs/concepts/subagents) | -Same break as Scenario 1, but the agent also: -- Searches Grubify source code for the root cause -- Finds exact file:line causing the memory leak -- Creates a GitHub issue with code references and fix suggestion -- May create a PR with the fix +> **Note on GitHub tools:** GitHub OAuth tools (code search, issue management) are **built-in native tools**, not MCP tools. Once the GitHub OAuth connector is set up, all agents β€” including subagents β€” get access to GitHub tools automatically through global settings. No explicit `mcp_tools` assignment is needed in subagent YAML. This is different from MCP connector tools (Datadog, Splunk, etc.) which require explicit `mcp_tools` assignment. +| Scheduled Task | Triage customer issues every 12 hours | [Scheduled Tasks](https://sre.azure.com/docs/capabilities/scheduled-tasks) | +| Code Repo | Agent indexes the Grubify source code | [Deep Context](https://sre.azure.com/docs/concepts/workspace-tools) | -> If the agent can't create an issue, nudge it: `Use the GitHub API to create the issue if the direct tool isn't working` +## Post-Deployment -## Scenario 3: Workflow Automation (Requires GitHub) +### Re-run the setup script ```bash -# Create sample customer issues (uses gh CLI, no PAT needed) -bash scripts/create-sample-issues.sh /grubify +# Full re-run (rebuilds container images + re-uploads everything) +./scripts/post-provision.sh -# Or Windows: -"C:\Program Files\Git\bin\bash.exe" scripts/create-sample-issues.sh /grubify +# Skip container image builds (just update KB, subagents, response plan) +./scripts/post-provision.sh --retry + +# Windows: run from CMD with Python in PATH +set PATH=%PATH%;C:\Users\%USERNAME%\AppData\Local\Programs\Python\Python312 +"C:\Program Files\Git\bin\bash.exe" scripts/post-provision.sh --retry ``` -1. Go to **Builder β†’ Scheduled tasks** β†’ **triage-grubify-issues** β†’ **Run task now** -2. Check `github.com//grubify/issues` β€” each `[Customer Issue]` gets: - - Classification: Bug, Performance, Feature Request, Question - - Labels: `bug`, `api-bug`, `severity-high`, etc. - - Triage comment from the agent +### Manual container deploy (Windows fallback) -## Bonus Scenarios +If the script deploys images but the app still shows the default page: -### Ask the Agent Anything +```cmd +for /f "tokens=*" %a in ('azd env get-value AZURE_CONTAINER_REGISTRY_NAME') do set ACR=%a +for /f "tokens=*" %a in ('azd env get-value CONTAINER_APP_NAME') do set APP=%a +for /f "tokens=*" %a in ('azd env get-value FRONTEND_APP_NAME') do set FE=%a +az containerapp update --name %APP% --resource-group rg-sre-lab --image %ACR%.azurecr.io/grubify-api:latest +az containerapp update --name %FE% --resource-group rg-sre-lab --image %ACR%.azurecr.io/grubify-frontend:latest +``` -Try these prompts in a new chat (no `/agent` needed β€” the meta agent handles these): +## Verify Setup -``` -What is the public endpoint URL for the Grubify frontend container app? -``` +After deployment completes, open your agent at [sre.azure.com](https://sre.azure.com) and click **Full setup**. You should see green checkmarks on: -``` -Show me the CPU and memory usage trends for the Grubify container app over the last hour -``` +| Card | Expected Status | +|------|----------------| +| **Code** | βœ… 1 repository | +| **Incidents** | βœ… Connected to Azure Monitor | +| **Azure resources** | βœ… 1 resource group added | +| **Knowledge files** | βœ… 1 file | -``` -Check if there are any Azure Advisor recommendations for my resource group -``` +> **Checkpoint:** If any card is missing a checkmark, re-run the post-provision script: `bash scripts/post-provision.sh --retry` -``` -What recent changes were made to resources in my resource group? Check the Activity Log. -``` +Once verified, click **"Done and go to agent"** to open the agent chat and start the team onboarding conversation. -### Custom Prompts with Runbook +### Team Onboarding -``` -Using the http-500-errors runbook, walk me through all the diagnostic KQL queries -and show me the results for the Grubify app -``` +The agent opens a **"Team onboarding"** thread automatically. It will: + +1. **Explore your connected context** β€” reads the code repository, Azure resources, and knowledge files you connected during setup +2. **Interview you about your team** β€” ask about your team structure, on-call rotation, services you own, and escalation paths + +Since the agent already has context from setup, try asking it questions: + +> *"What do you know about the Grubify app architecture?"* +> +> *"Summarize the HTTP errors runbook"* +> +> *"What Azure resources are in my resource group?"* -### Team Memory +The agent saves your team information to persistent memory and references it in every future investigation. +> **Tip:** Ask *"What should I do next?"* for personalized recommendations based on what's connected. + +## Lab Scenarios + +### Scenario 1: IT Operations (No GitHub required) + +Break the app and watch the agent investigate: + +```bash +./scripts/break-app.sh # macOS/Linux +# Windows: "C:\Program Files\Git\bin\bash.exe" scripts/break-app.sh ``` -Remember that our on-call rotation is: Monday-Wednesday is Team Alpha, -Thursday-Sunday is Team Beta. The escalation path is: on-call β†’ team lead β†’ VP Engineering. + +Then open [sre.azure.com](https://sre.azure.com) β†’ Incidents to watch the agent: +1. Detect the Azure Monitor alert +2. Query Log Analytics for error patterns +3. Reference the HTTP errors runbook +4. Apply remediation (restart/scale) +5. Summarize with root cause and evidence + +### Scenario 2: Developer (Requires GitHub) + +Ask the agent to search source code for root causes: +- File:line references to problematic code +- Correlation of production errors to code changes +- Suggested fixes with before/after examples + +### Scenario 3: Workflow Automation (Requires GitHub) + +Create sample support issues and let the agent triage them: + +```bash +./scripts/create-sample-issues.sh ``` -Then later ask: `Who is on call today?` +The agent classifies issues (Documentation, Bug, Feature Request), applies labels, and posts triage comments following the runbook. + +## Adding GitHub Later + +After initial setup, add GitHub by signing in via the OAuth URL: + +```bash +./scripts/setup-github.sh # macOS/Linux +# Windows: "C:\Program Files\Git\bin\bash.exe" scripts/setup-github.sh +``` ## Cleanup @@ -213,18 +250,27 @@ azd down --purge | Issue | Fix | |-------|-----| -| Python not found (Windows) | Disable Store aliases, reopen CMD | -| 405 on response plan | Wait 30s, run: `bash scripts/post-provision.sh --retry` | -| GitHub issue creation fails | Nudge: "Use the GitHub API to create the issue" | -| `az login` uses wrong account | Run `az logout` then `az login --use-device-code` | - -## Resources - -| Resource | Link | -|:---------|:-----| -| **SRE Agent Portal** | [sre.azure.com](https://sre.azure.com) | -| **Documentation** | [sre.azure.com/docs](https://sre.azure.com/docs) | -| **Blog** | [aka.ms/sreagent/blog](https://aka.ms/sreagent/blog) | -| **Labs** | [aka.ms/sreagent/lab](https://aka.ms/sreagent/lab) | -| **Pricing** | [aka.ms/sreagent/pricing](https://aka.ms/sreagent/pricing) | -| **Support** | [aka.ms/sreagent/github](https://aka.ms/sreagent/github) | +| `'bash' is not recognized` (Windows) | Run via: `"C:\Program Files\Git\bin\bash.exe" scripts/post-provision.sh` | +| `Python was not found` (Windows) | Install: `winget install Python.Python.3.12`, disable App execution aliases | +| `curl: error encountered when reading a file` | Python isn't in Git Bash PATH: `export PATH="$PATH:/c/Users/$USER/AppData/Local/Programs/Python/Python312"` | +| `roleAssignments/write` denied | Need Owner role on subscription. Check: `az role assignment list --assignee $(az ad signed-in-user show --query id -o tsv)` | +| `Microsoft.App not registered` | Run: `az provider register -n Microsoft.App --wait` | +| Grubify shows default page after deploy | Run manual deploy commands (see Post-Deployment section above) | +| Post-provision 405 on response plan | Wait 30s and run: `./scripts/post-provision.sh --retry` | + +## Regions + +SRE Agent is available in: `eastus2`, `swedencentral`, `australiaeast` + +## Links + +- [Azure SRE Agent Documentation](https://sre.azure.com/docs) +- [Getting Started Guide](https://sre.azure.com/docs/get-started/create-and-setup) +- [Connectors](https://sre.azure.com/docs/concepts/connectors) +- [Custom Agents](https://sre.azure.com/docs/concepts/subagents) +- [Incident Response](https://sre.azure.com/docs/capabilities/incident-response) +- [Azure Observability](https://sre.azure.com/docs/capabilities/diagnose-azure-observability) + +## License + +MIT