This document covers Phase 10 (kube-prometheus-stack), Phase 11/12 (upgrades), and Phase 15 (Loki + Alloy + Tempo + OpenTelemetry).
- Overview
- Phase 10: kube-prometheus-stack Deployment
- Phase 11/12: Upgrade to Latest Versions
- Phase 15: Logs — Loki + Alloy
- Phase 15 (suite): Traces — Tempo + OpenTelemetry
- Architecture
- Components
- Deployment Workflow
- Verification
- Troubleshooting
The Lumen project uses kube-prometheus-stack Helm chart for production-grade observability in the airgap environment. This replaces the previous manual Prometheus/Grafana deployment with a complete, operator-managed monitoring solution.
- Industry standard: Used by most production Kubernetes deployments
- Batteries included: 40+ pre-configured Grafana dashboards
- Operator pattern: ServiceMonitor/PrometheusRule CRDs vs manual ConfigMaps
- Complete metrics: Node Exporter + kube-state-metrics included
- Production-ready alerts: 100+ alert rules out of the box
- Helm management: Easy upgrades and rollbacks
Component Versions:
- Helm Chart: v55.0.0
- Prometheus: v2.48.0
- Grafana: 10.2.2
- AlertManager: v0.26.0
- Prometheus Operator: v0.68.0
- Node Exporter: v1.7.0
- kube-state-metrics: v2.10.1
Script: 01-connected-zone/scripts/08-pull-kube-prometheus-stack.sh
#!/bin/bash
set -e
PROMETHEUS_VERSION="v2.48.0"
ALERTMANAGER_VERSION="v0.26.0"
GRAFANA_VERSION="10.2.2"
PROMETHEUS_OPERATOR_VERSION="v0.68.0"
NODE_EXPORTER_VERSION="v1.7.0"
KUBE_STATE_METRICS_VERSION="v2.10.1"
HELM_CHART_VERSION="55.0.0"
# Download Helm chart
helm pull prometheus-community/kube-prometheus-stack --version ${HELM_CHART_VERSION}
# Pull all component images
docker pull quay.io/prometheus/prometheus:${PROMETHEUS_VERSION}
docker pull quay.io/prometheus/alertmanager:${ALERTMANAGER_VERSION}
docker pull docker.io/grafana/grafana:${GRAFANA_VERSION}
# ... (see script for full list)
# Save images to tar archives
docker save quay.io/prometheus/prometheus:${PROMETHEUS_VERSION} -o artifacts/prometheus.tar
# ... (see script for full list)Output:
artifacts/kube-prometheus-stack/images/- 8 tar files (one per component)artifacts/kube-prometheus-stack/helm/- Helm chart tarballartifacts/kube-prometheus-stack/images.txt- List of images with registry paths
Script: 02-transit-zone/push-kube-prometheus-stack.sh
Pushes all images to localhost:5000 registry for airgap deployment.
File: 03-airgap-zone/manifests/kube-prometheus-stack-helm/values-airgap-override.yaml
Key configurations:
# Global settings
global:
imageRegistry: localhost:5000
# Prometheus
prometheus:
prometheusSpec:
image:
registry: localhost:5000
repository: prometheus/prometheus
tag: v2.48.0
# Match ALL ServiceMonitors
serviceMonitorSelector: {}
podMonitorSelector: {}
retention: 15d
resources:
requests: {cpu: 200m, memory: 512Mi}
limits: {cpu: 1000m, memory: 2Gi}
# Grafana
grafana:
image:
registry: localhost:5000
repository: grafana/grafana
tag: "10.2.2"
adminPassword: admin
# Sidecar for auto-reload of dashboards
sidecar:
dashboards:
enabled: true
datasources:
enabled: true
# Node Exporter (hardware metrics)
nodeExporter:
enabled: true
image:
registry: localhost:5000
# kube-state-metrics (K8s object metrics)
kubeStateMetrics:
enabled: true
image:
registry: localhost:5000cd 03-airgap-zone
# Deploy Helm chart
helm install kube-prometheus-stack ./manifests/kube-prometheus-stack-helm \
-n monitoring \
--create-namespace \
-f manifests/kube-prometheus-stack-helm/values-airgap-override.yaml \
--waitCreated custom ServiceMonitors for:
- Lumen API (
manifests/kube-prometheus-stack/servicemonitors/lumen-api.yaml) - Traefik (
manifests/kube-prometheus-stack/servicemonitors/traefik.yaml) - Gitea (
manifests/kube-prometheus-stack/servicemonitors/gitea.yaml) - ArgoCD (
manifests/kube-prometheus-stack/servicemonitors/argocd.yaml)
Example (Lumen API):
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: lumen-api
namespace: lumen
labels:
release: kube-prometheus-stack # Critical for discovery
spec:
selector:
matchLabels:
app: lumen-api
endpoints:
- port: http
path: /metrics
interval: 30sFile: manifests/kube-prometheus-stack/dashboards/lumen-api-dashboard.yaml
Dashboard as ConfigMap with auto-discovery:
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-lumen-api
namespace: monitoring
labels:
grafana_dashboard: "1" # Critical: tells Grafana to auto-load
data:
lumen-api-dashboard.json: |
{
"title": "Lumen API - Airgap Monitoring",
"panels": [
{
"title": "HTTP Requests Total",
"targets": [{
"expr": "http_requests_total{job=\"lumen-api\"}",
"legendFormat": "{{exported_endpoint}} - {{method}} - {{status}}"
}]
},
...
]
}Key metrics tracked:
- HTTP Requests Total
- Request Rate (per second)
- Total /hello Requests (gauge)
- Go Runtime Metrics (goroutines, threads)
File: 03-airgap-zone/manifests/argocd/08-application-kube-prometheus.yaml
ArgoCD Application for Helm Chart:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: kube-prometheus-stack
namespace: argocd
spec:
project: default
source:
repoURL: http://gitea.gitea.svc.cluster.local:3000/lumen/lumen.git
targetRevision: HEAD
path: 03-airgap-zone/manifests/kube-prometheus-stack-helm
helm:
releaseName: kube-prometheus-stack
valueFiles:
- values-airgap-override.yaml
destination:
server: https://kubernetes.default.svc
namespace: monitoring
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=false
- ServerSideApply=trueMotivation: Upgrade all components to latest stable versions for security patches, performance improvements, and new features.
| Component | Old Version | New Version | Status |
|---|---|---|---|
| Helm Chart | v55.0.0 | v69.0.0 | ⬆️ Major upgrade |
| Prometheus | v2.48.0 | v3.5.1 | ⬆️ Major version (2→3) |
| Grafana | 10.2.2 | 12.4.0 | ⬆️ 2 major versions |
| AlertManager | v0.26.0 | v0.31.1 | ⬆️ 5 versions |
| Prometheus Operator | v0.68.0 | v0.78.2 | ⬆️ 10 versions |
| Node Exporter | v1.7.0 | v1.8.2 | ⬆️ Minor |
| kube-state-metrics | v2.10.1 | v2.14.0 | ⬆️ Patch |
| Grafana Sidecar | 1.25.2 | 1.30.1 | ⬆️ Minor |
Full comparison: See docs/VERSION-COMPARISON.md
Updated 01-connected-zone/scripts/08-pull-kube-prometheus-stack.sh:
PROMETHEUS_VERSION="v3.5.1" # Was: v2.48.0
GRAFANA_VERSION="12.4.0-22046043985" # Was: 10.2.2 (includes build number)
ALERTMANAGER_VERSION="v0.31.1" # Was: v0.26.0
PROMETHEUS_OPERATOR_VERSION="v0.78.2" # Was: v0.68.0
HELM_CHART_VERSION="69.0.0" # Was: 55.0.0# Connected Zone
cd 01-connected-zone
./scripts/08-pull-kube-prometheus-stack.sh
# Transit Zone
cd ../02-transit-zone
./push-kube-prometheus-stack.shUpdated values-airgap-override.yaml with new image tags.
Critical: Prometheus 3.x requires updated CRDs before Helm upgrade.
# Apply CRDs with server-side flag
kubectl apply --server-side \
-f 03-airgap-zone/manifests/kube-prometheus-stack-helm/charts/crds/crds/ \
--force-conflictshelm upgrade kube-prometheus-stack ./manifests/kube-prometheus-stack-helm \
-n monitoring \
-f manifests/kube-prometheus-stack-helm/values-airgap-override.yaml \
--wait# Check pod images
kubectl get pods -n monitoring -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'
# Expected output:
# prometheus-xxx localhost:5000/prometheus/prometheus:v3.5.1
# grafana-xxx localhost:5000/grafana/grafana:12.4.0-22046043985
# alertmanager-xxx localhost:5000/prometheus/alertmanager:v0.31.1Key changes:
- TSDB format changes (automatic migration on first startup)
- Some deprecated flags removed
- PromQL behavior improvements
Documentation: Prometheus 3.0 Announcement
Key changes:
- Dashboard JSON schema updates (auto-migrated)
- Enhanced dashboards UI
- Improved query performance
- Better RBAC
Documentation: Grafana v12.0 Release Notes
| Component | Old Version | New Version |
|---|---|---|
| ArgoCD | v2.12.3 | v3.2.0 |
| Dex | v2.38.0 | v2.41.1 |
| Redis | 7.0.15-alpine | 7.2.6-alpine |
Script: 01-connected-zone/scripts/09-pull-argocd.sh
ARGOCD_VERSION="v3.2.0"
DEX_VERSION="v2.41.1"
REDIS_VERSION="7.2.6-alpine"
docker pull quay.io/argoproj/argocd:${ARGOCD_VERSION}
docker pull ghcr.io/dexidp/dex:${DEX_VERSION}
docker pull docker.io/library/redis:${REDIS_VERSION}Script: 02-transit-zone/push-argocd.sh
curl -sL https://raw.githubusercontent.com/argoproj/argo-cd/v3.2.0/manifests/install.yaml \
-o /tmp/argocd-v3.2.0.yamlsed -e 's|quay.io/argoproj/argocd:v3.2.0|localhost:5000/argoproj/argocd:v3.2.0|g' \
-e 's|ghcr.io/dexidp/dex:v2.41.1|localhost:5000/dexidp/dex:v2.41.1|g' \
-e 's|redis:7.2.6-alpine|localhost:5000/redis:7.2.6-alpine|g' \
/tmp/argocd-v3.2.0.yaml > 03-airgap-zone/manifests/argocd/02-install-airgap.yaml# Apply to argocd namespace (important: use -n flag)
kubectl apply -n argocd -f 03-airgap-zone/manifests/argocd/02-install-airgap.yamlCritical: ArgoCD v3.2.0 requires explicit insecure mode when behind TLS termination (Traefik).
kubectl patch configmap argocd-cmd-params-cm -n argocd \
--type merge \
-p '{"data":{"server.insecure":"true"}}'
kubectl rollout restart deployment argocd-server -n argocdWhy needed: ArgoCD v3+ enforces TLS by default. When Traefik handles TLS termination, ArgoCD must run in insecure mode to avoid redirect loops.
kubectl apply -f 03-airgap-zone/manifests/argocd/04-application-lumen.yaml
kubectl apply -f 03-airgap-zone/manifests/argocd/06-application-network-policies.yaml
kubectl apply -f 03-airgap-zone/manifests/argocd/07-application-traefik.yaml
kubectl apply -f 03-airgap-zone/manifests/argocd/08-application-kube-prometheus.yamlMajor changes:
- TLS enforcement by default
- New UI improvements
- Enhanced RBAC features
- Better performance for large repos
End of Life: ArgoCD v3.0 reached EOL on February 2, 2026.
┌─────────────────────────────────────────────────────────┐
│ kube-prometheus-stack │
│ (Helm Chart) │
├─────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Prometheus │ │ Grafana │ │ AlertManager │ │
│ │ Operator │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ │ │ │ │
│ ┌──────▼──────────────────▼──────────────────▼──────┐ │
│ │ Prometheus Server (v3.5.1) │ │
│ │ - Scrapes metrics from ServiceMonitors │ │
│ │ - Stores TSDB (15 days retention) │ │
│ │ - Evaluates PrometheusRules │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ │ │
│ ┌──────▼──────────────────────────────────────────┐ │
│ │ Metrics Sources (ServiceMonitors) │ │
│ ├──────────────────────────────────────────────────┤ │
│ │ • Node Exporter (hardware metrics) │ │
│ │ • kube-state-metrics (K8s object metrics) │ │
│ │ • Lumen API (/metrics endpoint) │ │
│ │ • Traefik (proxy metrics) │ │
│ │ • Gitea (Git server metrics) │ │
│ │ • ArgoCD (GitOps metrics) │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
| Feature | Manual Deployment | kube-prometheus-stack |
|---|---|---|
| Deployment | Static YAML manifests | Helm chart |
| Configuration | ConfigMap scrape_configs | ServiceMonitor CRDs |
| Dashboards | 1 custom | 40+ pre-configured |
| Node Metrics | ❌ None | ✅ Node Exporter |
| K8s Metrics | ❌ None | ✅ kube-state-metrics |
| Alert Rules | 3 basic | 100+ production-ready |
| HA | Single replica | Multi-replica ready |
| Operator | ❌ None | ✅ Prometheus Operator |
| Upgrades | Manual kubectl apply | helm upgrade |
Purpose: Manages Prometheus instances via CRDs.
CRDs:
Prometheus- Defines Prometheus server instancesServiceMonitor- Defines targets to scrapePrometheusRule- Defines alert/recording rulesAlertmanager- Defines AlertManager instances
Benefits:
- Declarative configuration via CRDs
- Automatic config reload
- Namespace isolation
- Dynamic discovery
Features:
- Long-Term Support (LTS) release
- Improved cardinality management
- Better performance
- TSDB v3 format
Configuration:
- Retention: 15 days
- Storage: emptyDir (ephemeral)
- Resources: 512Mi RAM request, 2Gi limit
Features:
- 40+ pre-configured dashboards
- Auto-discovery of dashboards via sidecar
- Enhanced UI
- Better RBAC
Access:
- URL: https://grafana.airgap.local
- Username:
admin - Password:
admin
Dashboards:
- Kubernetes / Compute Resources / Cluster
- Kubernetes / Compute Resources / Namespace
- Node Exporter / Nodes
- Prometheus / Overview
- Custom: Lumen API Dashboard
Purpose: Hardware and OS-level metrics.
Metrics:
- CPU usage per core
- Memory usage
- Disk I/O
- Network traffic
- Filesystem usage
Deployment: DaemonSet (one pod per node)
Purpose: Kubernetes object state metrics.
Metrics:
- Pod count/status
- Deployment status
- Node status
- PersistentVolumeClaim usage
- ConfigMap/Secret count
Purpose: Alert routing and notification.
Configuration:
- No external receivers (airgap)
- Internal routing only
- De-duplication
- Grouping
Access: https://alertmanager.airgap.local
# 1. Connected Zone - Download all artifacts
cd 01-connected-zone
./scripts/08-pull-kube-prometheus-stack.sh
./scripts/09-pull-argocd.sh
# 2. Transit Zone - Push to registry
cd ../02-transit-zone
./setup.sh # Ensure registry running
./push-kube-prometheus-stack.sh
./push-argocd.sh
# 3. Airgap Zone - Deploy monitoring
cd ../03-airgap-zone
# Deploy kube-prometheus-stack
helm install kube-prometheus-stack ./manifests/kube-prometheus-stack-helm \
-n monitoring \
--create-namespace \
-f manifests/kube-prometheus-stack-helm/values-airgap-override.yaml
# Deploy custom ServiceMonitors
kubectl apply -f manifests/kube-prometheus-stack/servicemonitors/
# Deploy custom Grafana dashboards
kubectl apply -f manifests/kube-prometheus-stack/dashboards/
# Deploy ArgoCD
kubectl apply -n argocd -f manifests/argocd/02-install-airgap.yaml
kubectl patch configmap argocd-cmd-params-cm -n argocd \
--type merge -p '{"data":{"server.insecure":"true"}}'
kubectl rollout restart deployment argocd-server -n argocd
# Deploy ArgoCD Applications
kubectl apply -f manifests/argocd/04-application-lumen.yaml
kubectl apply -f manifests/argocd/06-application-network-policies.yaml
kubectl apply -f manifests/argocd/07-application-traefik.yaml
kubectl apply -f manifests/argocd/08-application-kube-prometheus.yamlhelm list -n monitoring
# Expected output:
# NAME NAMESPACE REVISION STATUS CHART APP VERSION
# kube-prometheus-stack monitoring 1 deployed kube-prometheus-stack-69.0.0 v0.78.2kubectl get pods -n monitoring
# Expected pods (all Running):
# - alertmanager-xxx (2/2)
# - grafana-xxx (3/3)
# - prometheus-operator-xxx (1/1)
# - prometheus-kube-prometheus-stack-prometheus-0 (2/2)
# - kube-state-metrics-xxx (1/1)
# - node-exporter-xxx (1/1 per node)# Access Prometheus UI
open https://prometheus.airgap.local
# Navigate to: Status → Service Discovery
# Should see ServiceMonitors for:
# - lumen-api (namespace: lumen)
# - traefik (namespace: traefik)
# - gitea (namespace: gitea)
# - argocd (namespace: argocd)
# - node-exporter (namespace: monitoring)
# - kube-state-metrics (namespace: monitoring)# Navigate to: Status → Targets
# All targets should be "UP" (green)open https://grafana.airgap.local
# Login: admin/admin
# Check dashboards available:
# - Kubernetes / Compute Resources / Cluster
# - Lumen API - Airgap Monitoring (custom)
# - Node Exporter / Nodes
# - ... 40+ total# Generate traffic to Lumen API
for i in {1..100}; do
curl -k https://lumen-api.airgap.local/hello
sleep 0.1
done
# Check in Grafana → Lumen API Dashboard
# - HTTP Requests Total should increment
# - Request Rate should show spike
# - Total /hello Requests should increasekubectl get applications -n argocd
# Expected:
# NAME SYNC STATUS HEALTH STATUS
# kube-prometheus-stack Synced Healthy
# lumen-app Synced Healthy
# lumen-network-policies Synced Healthy
# traefik Synced HealthySymptom:
kubectl get pods -n monitoring
# prometheus-xxx 0/2 ImagePullBackOffDiagnosis:
kubectl describe pod <pod-name> -n monitoring | grep "Failed to pull image"
# Error: Failed to pull image "quay.io/prometheus/prometheus:v3.5.1"Solution: Values file not properly overriding image registry.
# Fix values-airgap-override.yaml
prometheus:
prometheusSpec:
image:
registry: localhost:5000 # ADD THIS
repository: prometheus/prometheus
tag: v3.5.1Re-deploy:
helm upgrade kube-prometheus-stack ./manifests/kube-prometheus-stack-helm \
-n monitoring \
-f manifests/kube-prometheus-stack-helm/values-airgap-override.yamlSymptom: Prometheus UI → Service Discovery shows 0 ServiceMonitors.
Diagnosis:
kubectl get servicemonitor -n lumen -o yaml
# Check if label "release: kube-prometheus-stack" existsSolution: Add label to ServiceMonitor:
metadata:
labels:
release: kube-prometheus-stack # CRITICALAlso check Prometheus selector:
prometheus:
prometheusSpec:
serviceMonitorSelector: {} # Empty = match ALLSymptom: Dashboard panels show "No data" despite metrics existing.
Diagnosis:
# Check Grafana can reach Prometheus
kubectl exec -n monitoring -it deploy/kube-prometheus-stack-grafana -- \
wget -qO- http://kube-prometheus-stack-prometheus:9090/api/v1/query?query=upSolution: Check NetworkPolicy allows Grafana → Prometheus:
# In monitoring namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-grafana-to-prometheus
namespace: monitoring
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: prometheus
policyTypes: [Ingress]
ingress:
- from:
- podSelector:
matchLabels:
app.kubernetes.io/name: grafana
ports:
- protocol: TCP
port: 9090Symptom: Browser shows "ERR_TOO_MANY_REDIRECTS" when accessing https://argocd.airgap.local
Root Cause: ArgoCD v3+ enforces TLS by default. When Traefik terminates TLS, ArgoCD tries to redirect to HTTPS, causing a loop.
Solution: Enable insecure mode:
kubectl patch configmap argocd-cmd-params-cm -n argocd \
--type merge \
-p '{"data":{"server.insecure":"true"}}'
kubectl rollout restart deployment argocd-server -n argocdSymptom:
Error: failed to create typed patch object: field not declared in schema
Root Cause: Helm chart v69 requires updated CRDs for Prometheus 3.x.
Solution: Apply CRDs before Helm upgrade:
kubectl apply --server-side \
-f manifests/kube-prometheus-stack-helm/charts/crds/crds/ \
--force-conflicts
# Then run helm upgrade
helm upgrade kube-prometheus-stack ./manifests/kube-prometheus-stack-helm \
-n monitoring \
-f manifests/kube-prometheus-stack-helm/values-airgap-override.yaml| Service | URL | Credentials |
|---|---|---|
| Prometheus | https://prometheus.airgap.local | None |
| Grafana | https://grafana.airgap.local | admin/admin |
| AlertManager | https://alertmanager.airgap.local | None |
| ArgoCD | https://argocd.airgap.local | admin/[see secret] |
Get ArgoCD password:
kubectl -n argocd get secret argocd-initial-admin-secret \
-o jsonpath="{.data.password}" | base64 -dscripts/08-pull-kube-prometheus-stack.sh- Download monitoring images + Helm chartscripts/09-pull-argocd.sh- Download ArgoCD v3.2.0 images
push-kube-prometheus-stack.sh- Push monitoring images to registrypush-argocd.sh- Push ArgoCD images to registry
manifests/kube-prometheus-stack-helm/values-airgap-override.yaml- Airgap configuration
manifests/kube-prometheus-stack/servicemonitors/- Custom ServiceMonitors (4 files)manifests/kube-prometheus-stack/dashboards/- Custom Grafana dashboards (1 file)
manifests/argocd/02-install-airgap.yaml- ArgoCD v3.2.0 installationmanifests/argocd/08-application-kube-prometheus.yaml- ArgoCD Application for Helm chart
✅ Downloading Helm charts in connected zone ✅ Extracting and customizing charts ✅ Overriding image registries for airgap deployment ✅ Helm release management (install, upgrade, rollback)
✅ Declarative monitoring: ServiceMonitor/PrometheusRule CRDs vs manual ConfigMaps ✅ Dynamic discovery: Add ServiceMonitor → auto-scraped by Prometheus ✅ Operator reconciliation loop: Operator watches CRDs, updates Prometheus config ✅ Production pattern: Separation of concerns (Operator manages Prometheus instances)
✅ Node Exporter: Hardware metrics (CPU, RAM, disk, network per node) ✅ kube-state-metrics: K8s object metrics (pod count, deployment status, PVC usage) ✅ Pre-built dashboards: 40+ Grafana dashboards for K8s monitoring ✅ Production alerts: 100+ alert rules (pod crash loops, high memory, etc.)
✅ Prometheus 2.x → 3.x: Breaking changes, TSDB migration, new features ✅ Grafana 10 → 12: Dashboard schema updates, UI improvements ✅ ArgoCD 2.x → 3.x: TLS enforcement, insecure mode configuration ✅ CRD management: Server-side apply, force-conflicts resolution
✅ ArgoCD managing Helm charts from Git repository ✅ Auto-sync and self-heal capabilities ✅ Declarative application management ✅ Version control for infrastructure
Completed the Logs pillar of observability and added Go profiling to lumen-api.
Why not loki-stack? The grafana/loki-stack chart is officially deprecated (no more updates). The standalone grafana/loki chart v6.53.0 is the replacement.
Why not S3/MinIO? Single-node airgap setup — filesystem storage is simpler, sufficient, and avoids the deprecated MinIO subchart.
Deployment mode: SingleBinary — all Loki components (ingester, querier, compactor, etc.) run in one pod. Correct for single-node K3s.
Key config:
deploymentMode: SingleBinary
loki:
auth_enabled: false
storage:
type: filesystem
schemaConfig:
configs:
- from: "2024-04-01"
store: tsdb # TSDB replaces deprecated boltdb-shipper
object_store: filesystem
schema: v13 # Current schema versionImages used:
| Image | Tag | Role |
|---|---|---|
localhost:5000/grafana/loki |
3.6.5 |
Log storage + query engine |
localhost:5000/nginxinc/nginx-unprivileged |
1.29-alpine |
Loki gateway |
localhost:5000/kiwigrid/k8s-sidecar |
1.30.9 |
Config sidecar |
Why not Promtail? Promtail is EOL March 2026 — no more updates or security patches. Grafana Alloy is the official replacement.
Alloy runs as a DaemonSet (one pod per node) and:
- Discovers all pods via Kubernetes API
- Tails pod logs from node filesystem
- Parses JSON logs from lumen-api (via
stage.json) - Extracts
levellabel for filtering in Grafana - Ships to Loki gateway
Images used:
| Image | Tag | Role |
|---|---|---|
localhost:5000/grafana/alloy |
v1.13.1 |
Log collector DaemonSet |
localhost:5000/prometheus-operator/prometheus-config-reloader |
v0.81.0 |
Config reloader sidecar |
pprof endpoints added to app.go:
/debug/pprof/ — index
/debug/pprof/profile — CPU profiling (30s)
/debug/pprof/trace — goroutine trace
/debug/pprof/symbol — symbol lookup
/debug/pprof/cmdline — command line args
Structured JSON logging via log/slog (stdlib, no external deps):
{"time":"2026-02-17T10:00:00Z","level":"INFO","msg":"request","method":"GET","path":"/hello","status":200,"duration_ms":1,"remote_addr":"10.0.0.1:12345"}Alloy parses these JSON fields and indexes level as a Loki label — enables filtering by {level="ERROR"} in Grafana Explore.
Added to kube-prometheus-stack values (additionalDataSources):
additionalDataSources:
- name: Loki
type: loki
uid: loki
url: http://loki-gateway.monitoring.svc.cluster.local
access: proxy
jsonData:
maxLines: 1000Grafana → Explore → Loki
# All lumen namespace logs
{namespace="lumen"}
# Only lumen-api errors
{namespace="lumen", app="lumen-api", level="ERROR"}
# Search for specific text
{namespace="lumen"} |= "Redis"
# Parse JSON and filter by HTTP status
{namespace="lumen", app="lumen-api"} | json | status >= 500
| Pillar | Status | Stack |
|---|---|---|
| Metrics | ✅ Complete | Prometheus 3.5.1 + Grafana 12.4.0 |
| Logs | ✅ Complete | Loki 3.6.5 + Alloy v1.13.1 |
| Traces | ⏳ Pending | Tempo (next) |
Completed the Traces pillar of observability. lumen-api v1.2.0 now emits OpenTelemetry spans to Grafana Tempo, and trace_id appears in every structured log — enabling one-click navigation from a Loki log line to the corresponding Tempo trace in Grafana.
Mode: Monolithic (single binary, all components in one pod).
Storage: Local filesystem — no S3/MinIO needed for single-node airgap.
Deployment: Helm chart grafana/tempo:1.24.4, extracted to 03-airgap-zone/manifests/tempo/.
Key config (values-airgap.yaml):
tempo:
registry: localhost:5000
repository: grafana/tempo
tag: "2.10.0"
storage:
trace:
backend: local
local:
path: /var/tempo/traces
wal:
path: /var/tempo/wal
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318" # lumen-api sends traces here
retention: 336h # 14 days
persistence:
enabled: true
storageClassName: local-path
size: 5GiArgoCD Application: 03-airgap-zone/manifests/argocd/12-application-tempo.yaml
- sync-wave:
"6"(after Loki wave 4, Alloy wave 5) - namespace:
monitoring - Helm releaseName:
tempo
Access: https://tempo.airgap.local (Traefik IngressRoute 17-tempo-ingressroute.yaml)
Go SDK versions used:
| Package | Version | Note |
|---|---|---|
go.opentelemetry.io/otel |
v1.37.0 |
Core SDK |
go.opentelemetry.io/otel/sdk |
v1.37.0 |
TracerProvider |
go.opentelemetry.io/otel/trace |
v1.37.0 |
Span API |
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp |
v1.37.0 |
OTLP HTTP exporter |
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp |
v0.62.0 |
HTTP middleware |
Version note: otelhttp
v0.65.0requiresotel@v1.40.0— incompatible withv1.37.0. Usev0.62.0which is fully compatible.
New files:
internal/tracing/tracing.go — TracerProvider initialization:
func Init(ctx context.Context) (func(context.Context) error, error) {
endpoint := os.Getenv("TEMPO_ENDPOINT")
// default: http://tempo.monitoring.svc.cluster.local:4318
exporter, _ := otlptracehttp.New(ctx,
otlptracehttp.WithEndpointURL(endpoint),
otlptracehttp.WithInsecure(),
)
res := resource.NewWithAttributes(semconv.SchemaURL,
semconv.ServiceName("lumen-api"),
semconv.ServiceVersion("v1.2.0"),
)
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(res),
sdktrace.WithSampler(sdktrace.AlwaysSample()),
)
otel.SetTracerProvider(tp)
return tp.Shutdown, nil
}internal/middleware/tracing.go — OTel HTTP middleware:
func Tracing(serviceName string) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return otelhttp.NewHandler(next, serviceName,
otelhttp.WithMessageEvents(otelhttp.ReadEvents, otelhttp.WriteEvents),
)
}
}Middleware chain in app.go:
handler := middleware.Recovery(
middleware.Tracing("lumen-api")( // OTel: creates root span per request
middleware.Logging( // slog: injects trace_id from context
middleware.Metrics(m)(mux), // Prometheus: records HTTP metrics
),
),
)Child spans in handlers (handlers.go):
var tracer = otel.Tracer("lumen-api")
func (h *Handler) Health(w http.ResponseWriter, r *http.Request) {
ctx, span := tracer.Start(r.Context(), "health.check")
defer span.End()
_, redisSpan := tracer.Start(ctx, "redis.ping")
// ... redis ping
redisSpan.End()
span.SetAttributes(attribute.String("health.status", status))
}
func (h *Handler) Hello(w http.ResponseWriter, r *http.Request) {
ctx, span := tracer.Start(r.Context(), "hello.handler")
defer span.End()
_, redisSpan := tracer.Start(ctx, "redis.increment")
// ... redis incr
redisSpan.SetAttributes(attribute.Int64("counter.value", counter))
redisSpan.End()
}trace_id in logs (middleware/logging.go):
span := trace.SpanFromContext(r.Context())
slog.Info("request",
"method", r.Method,
"path", r.URL.Path,
"status", wrapped.statusCode,
"duration_ms", time.Since(start).Milliseconds(),
"trace_id", span.SpanContext().TraceID().String(), // NEW
)Every request log now contains trace_id, e.g.:
{"time":"2026-02-18T10:00:00Z","level":"INFO","msg":"request","method":"GET","path":"/hello","status":200,"duration_ms":1,"trace_id":"173db01794a77852a12..."}Updated kube-prometheus-stack-helm/values.yaml (additionalDataSources):
additionalDataSources:
- name: Loki
type: loki
uid: loki
url: http://loki-gateway.monitoring.svc.cluster.local
jsonData:
maxLines: 1000
derivedFields:
- name: TraceID
matcherRegex: '"trace_id":"(\w+)"' # parse trace_id from JSON log
url: '${__value.raw}'
datasourceUid: tempo # → open in Tempo
- name: Tempo
type: tempo
uid: tempo
url: http://tempo.monitoring.svc.cluster.local:3200
jsonData:
httpMethod: GET
tracesToLogsV2: # Tempo → Loki drilldown
datasourceUid: loki
spanStartTimeShift: '-1m'
spanEndTimeShift: '1m'
tags:
- key: appFile: 03-airgap-zone/manifests/network-policies/14-allow-tempo.yaml
Three policies:
allow-tempo-otlp-egress(lumen ns): lumen-api → monitoring port 4318 (OTLP HTTP)tempo-otlp-ingress(monitoring ns): accept from lumen:4318/4317, from grafana:3200, from traefik:3200grafana-tempo-egress(monitoring ns): grafana → tempo:3200, grafana → loki:3100
image: localhost:5000/lumen-api:v1.2.0
env:
- name: TEMPO_ENDPOINT
value: "http://tempo.monitoring.svc.cluster.local:4318"Grafana → Explore → Tempo → Search → Service Name: lumen-api
Shows all traces with:
- Trace ID (e.g.,
173db01794a77852...) - Root span name (
/hello,/health) - Duration
- Child spans:
hello.handler→redis.increment
Grafana → Explore → Loki → query {namespace="lumen"} → click trace_id value → opens Tempo
| Pillar | Status | Stack |
|---|---|---|
| Metrics | ✅ Complete | Prometheus 3.5.1 + Grafana 12.4.0 |
| Logs | ✅ Complete | Loki 3.6.5 + Alloy v1.13.1 |
| Traces | ✅ Complete | Grafana Tempo 2.10.0 + OpenTelemetry Go SDK v1.37.0 |
- kube-prometheus-stack Helm Chart
- Prometheus 3.0 Announcement
- Grafana v12 Release Notes
- ArgoCD v3.0 Release
- Prometheus Operator Documentation
- Loki Deployment Modes
- Grafana Alloy — Promtail Migration
- Promtail EOL Announcement
- VERSION-COMPARISON.md - Detailed version comparison
- TESTING-MONITORING.md - Testing procedures
- Grafana Tempo Documentation
- OpenTelemetry Go SDK
- OTel contrib otelhttp
Last Updated: February 18, 2026 Project: Lumen Airgap Kubernetes Phases Covered: Phase 10 (kube-prometheus-stack), Phase 11/12 (Upgrades), Phase 15 (Loki + Alloy + Tempo + OpenTelemetry)