Skip to content

nullniverse/awesome-sre-tools

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

313 Commits
 
 
 
 
 
 

Repository files navigation

Awesome Site Reliability Engineering Tools Awesome

A curated list of Site Reliability and Production Engineering tools - Maintained by Raghu Chinnannan and Squadcast

Contents

Development

Source Code Management

Project Management & Issue Tracking Software

Bug / Defect Tracking Software

Code Editors and IDEs

Continuous Testing

Continuous Integration

Build

Integration

Continuous Delivery

Deployment

Infrastructure orchestration

Container

Container Registry

Container Orchestration

Continuous Monitoring

  • AWS CloudWatch
  • DebugBear
  • Prometheus
  • StackDriver
  • Sensu
  • Sentry
  • CopperEgg
  • Crashlytics
  • Kapacitor
  • loggly
  • logmatic
  • Logstash
  • MongoDB Atlas
  • MongoDB Cloud Manager
  • NewRelic
  • ReleaseRun Vulnerability Scanner
  • Papertrail
  • PageGuard - Free all-in-one website health scanner. Core Web Vitals, SEO, WCAG 2.1 accessibility, and best practices. AI-generated action plan. No signup required.
  • Pingdom
  • ServerDensity
  • Zabbix
  • InsightOps
  • AppSignal
  • API Status Check - Centralized dashboard tracking real-time status and outages for 1,000+ popular APIs and services (AWS, Stripe, GitHub, Twilio, etc.). Monitor third-party dependencies, get instant outage alerts, reduce MTTR.
  • Grafana
  • VictoriaMetrics
  • Chaos Genius
  • Cloud Waste Scanner - Detects cloud waste and helps DevOps/platform teams identify quick cloud cost optimization opportunities.
  • Thanos
  • Mimir
  • Hydrozen.io - Uptime monitoring & Statuspages
  • SSL Certificate Monitor - Open-source SSL/TLS certificate expiry monitoring tool with email alerts
  • DNS Propagation Checker - Open-source DNS propagation monitoring tool with global DNS server coverage
  • whatbroke.today - AI-powered outage aggregator tracking 100+ cloud services with Telegram alerts
  • Steampipe.io - Universal SQL interface to any cloud API
  • Better Stack
  • Netdata
  • DoctorGPT - Brings GPT into production for application log error monitoring
  • Dynatrace
  • Datadog
  • DevHelm - Developer-first uptime monitoring with HTTP, DNS, TCP, ICMP, and heartbeat checks, dependency intelligence for 80+ providers, hosted status pages, incident management, and a full developer surface (CLI, SDKs, Terraform provider, MCP server).
  • Elastic APM
  • Healthchecks.io
  • OnlineOrNot - Uptime monitoring for websites, APIs, and cron jobs, with integrated status pages.
  • Uptrack - Uptime monitoring with 30-second checks on free tier, consecutive-check alert confirmation to cut false positives, hosted status pages, and a built-in MCP server for AI agents.
  • Streamdal - Code-Native Data Privacy - embed privacy controls in your application code to detect and monitor PII. Streamdal
  • Dash0 - OpenTelemetry Native Observability, built on CNCF Open Standards such as PromQL, Perses and OTLP with full cost control. Supporting Metrics, Traces and Logs with full custom dashboarding and alerting capabilities.
  • CICube - AI DevOps monitoring platform by monitoring your CI workflows, detect anomalies, and provide actionable fixes.
  • Middleware - A Full-Stack Cloud Observability Platform designed to empower developers and organizations to monitor, optimize, and streamline their applications and infrastructure in real-time.
  • Shipfox - Boost GitHub Actions speed by 2x and cut costs by up to 75%, with smarter caching, deep CI insights, and zero-config setup.
  • Ingero - eBPF-based GPU causal observability agent. Traces CUDA APIs and host kernel events to build causal chains explaining GPU latency. Includes MCP server for AI-assisted incident investigation.
  • cloud-audit - AWS security auditing CLI that runs 17 checks across IAM, S3, EC2, VPC, and RDS with built-in remediation engine generating AWS CLI commands and Terraform snippets.
  • FlareWarden - Uptime, content, and dependency monitoring with multi-region verification, status pages, and incident management.
  • Phare - Shockingly good uptime monitoring, alerts, incident management, and status pages.
  • API Status Check - Real-time status monitoring dashboard for 250+ developer APIs including AWS, Stripe, GitHub, and OpenAI. Free, no signup required.
  • LynxDB - Lightweight columnar log analytics database for SRE workflows, with a pipe-style query language inspired by SPL for investigating production logs.
  • KubeStellar Console - Open-source multi-cluster Kubernetes dashboard with AI-powered operations, MCP server bridging kubeconfig to LLM agents, and real-time observability across edge and cloud clusters. CNCF Sandbox. KubeStellar Console
  • Apitally - API monitoring, analytics, and request logging for REST APIs, with lightweight open-source SDKs for Python, Node.js, Go, .NET, and Java.
  • Riftmap - Cross-repo infrastructure dependency discovery and change impact analysis for multi-repo environments using Terraform, Docker, Helm, and more.
  • Oack - HTTP monitoring with TCP kernel telemetry, 6-phase latency breakdown, Server-Timing header capture, Cloudflare CDN enrichment, and built-in incident management with on-call scheduling.
  • OpenClaw Monitor - Real-time AI agent monitoring dashboard for OpenClaw agents. Track Gateway status, sessions, token usage & trends.
  • agenttrace - TUI observability for AI coding agents. Track cost, tokens, tool failures, latency, anomalies, health, diffs, and CI gates across Claude Code, Codex CLI, Gemini CLI, Aider, and Cursor exports.

Incident Management / Incident Response / IT Alerting / On-Call

IT Service Management

Incident Communication

Internal Developer Portal

AI SRE Tools & SRE Copilots

  • Sherlocks.ai
  • Resolve.ai
  • Deductive.ai
  • Ingero - eBPF-based GPU causal observability agent. Traces CUDA APIs and host kernel events to build causal chains explaining GPU latency. Includes MCP server for AI-assisted incident investigation.
  • IncidentFox (open source)
  • metoro.io
  • Ops AI by Middleware
  • tailscale-mcp - MCP server with 52 tools for managing Tailscale tailnets from AI assistants like Claude Code and Cursor.
  • KubeStellar Console - AI-powered multi-cluster Kubernetes management console with MCP server (kc-agent) for AI-assisted cluster operations, pod inspection, deployment management, and real-time observability across distributed environments.

Related Lists

Stargazers over time

Stargazers over time

Licence

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

About

A curated list of Site Reliability and Production Engineering Tools

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors