Skip to content

bukx/observability-platform

Repository files navigation

Observability Platform

Full-stack observability and incident response platform built around metrics, logs, traces, alerting, runbooks, and chaos validation. The repo demonstrates how to move from passive monitoring to an active reliability practice with clearer signals and faster response workflows.

Architecture Diagram

Why this repo matters

A lot of monitoring demos stop at dashboards. This project goes further by tying together the three pillars of observability, SLI/SLO thinking, alert routing, runbooks, and fault injection.

What is included

  • instrumented application source under app/
  • Prometheus, Alertmanager, Grafana, ELK, and OpenTelemetry deployment assets
  • Grafana dashboards and alerting rules
  • chaos scripts for latency and error scenarios
  • runbooks and postmortem templates for operational response
  • local Docker assets plus Kubernetes manifests

Observability scope

  • Metrics: Prometheus and Grafana for collection, dashboards, and alerting
  • Logs: ELK stack for centralized log aggregation and search
  • Traces: OpenTelemetry and Jaeger-style tracing pipeline
  • Response: Alertmanager, runbooks, and postmortem templates
  • Validation: chaos scripts to test the system under failure conditions

Quick start

# Deploy the stack
kubectl apply -f k8s/app/
kubectl apply -f k8s/prometheus/
kubectl apply -f k8s/grafana/
kubectl apply -f k8s/elk/
kubectl apply -f k8s/otel-collector/

# Inject faults and observe behavior
./chaos-scripts/chaos-runner.sh latency
./chaos-scripts/chaos-runner.sh errors
./chaos-scripts/chaos-runner.sh reset

Repository layout

.
|-- app/                  # instrumented application
|-- chaos-scripts/        # fault injection scripts
|-- dashboards/           # Grafana dashboard assets
|-- docker/               # local container assets
|-- k8s/                  # Kubernetes deployment manifests
|-- postmortem-templates/ # incident review templates
|-- runbooks/             # operational runbooks
|-- docs/                 # diagrams and supporting docs
`-- .github/              # validation workflows

What this demonstrates

  • end-to-end observability design across metrics, logs, and traces
  • operational maturity through alerting, runbooks, and postmortems
  • SLI/SLO-oriented monitoring instead of dashboard sprawl
  • validation of reliability assumptions through controlled chaos testing

About

Observability and incident response platform with Prometheus, Grafana, ELK, OpenTelemetry, SLI/SLO dashboards, and chaos engineering.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors