Full-stack observability and incident response platform built around metrics, logs, traces, alerting, runbooks, and chaos validation. The repo demonstrates how to move from passive monitoring to an active reliability practice with clearer signals and faster response workflows.
A lot of monitoring demos stop at dashboards. This project goes further by tying together the three pillars of observability, SLI/SLO thinking, alert routing, runbooks, and fault injection.
- instrumented application source under
app/ - Prometheus, Alertmanager, Grafana, ELK, and OpenTelemetry deployment assets
- Grafana dashboards and alerting rules
- chaos scripts for latency and error scenarios
- runbooks and postmortem templates for operational response
- local Docker assets plus Kubernetes manifests
- Metrics: Prometheus and Grafana for collection, dashboards, and alerting
- Logs: ELK stack for centralized log aggregation and search
- Traces: OpenTelemetry and Jaeger-style tracing pipeline
- Response: Alertmanager, runbooks, and postmortem templates
- Validation: chaos scripts to test the system under failure conditions
# Deploy the stack
kubectl apply -f k8s/app/
kubectl apply -f k8s/prometheus/
kubectl apply -f k8s/grafana/
kubectl apply -f k8s/elk/
kubectl apply -f k8s/otel-collector/
# Inject faults and observe behavior
./chaos-scripts/chaos-runner.sh latency
./chaos-scripts/chaos-runner.sh errors
./chaos-scripts/chaos-runner.sh reset.
|-- app/ # instrumented application
|-- chaos-scripts/ # fault injection scripts
|-- dashboards/ # Grafana dashboard assets
|-- docker/ # local container assets
|-- k8s/ # Kubernetes deployment manifests
|-- postmortem-templates/ # incident review templates
|-- runbooks/ # operational runbooks
|-- docs/ # diagrams and supporting docs
`-- .github/ # validation workflows
- end-to-end observability design across metrics, logs, and traces
- operational maturity through alerting, runbooks, and postmortems
- SLI/SLO-oriented monitoring instead of dashboard sprawl
- validation of reliability assumptions through controlled chaos testing
