This document covers the two stateful data stores deployed in the lumen airgap cluster.
lumen namespace
├── redis-master-0 (node-1, PVC 1Gi) ← writes
├── redis-replica-0 (node-2) ← reads
└── redis-sentinel-{0,1,2} ← leader election (quorum=2)
- 1 master + 1 replica + 3 sentinels — tolerates 1 node failure
- Sentinels use quorum=2 to elect a new master if the current one is unreachable
- lumen-api connects via
go-redis/v9NewFailoverClient— discovers master automatically via sentinels
REDIS_MODE=sentinel
REDIS_SENTINEL_ADDRS=redis-sentinel.lumen.svc.cluster.local:26379
REDIS_MASTER_NAME=mymaster
master pod dies
→ sentinels detect unavailability (within ~5s)
→ quorum reached (2/3 sentinels agree)
→ replica promoted to master
→ lumen-api reconnects automatically via sentinel
→ downtime < 10 seconds
CNPG is a Kubernetes-native operator (CNCF sandbox) that manages PostgreSQL clusters as CRDs. Key advantages over a plain StatefulSet:
- Automatic master election and failover (Raft-based, quorum voting)
- Two distinct services:
-rw(master only) and-ro(replicas) for read/write splitting - Auto-generated credentials in a Secret (
lumen-db-app) - Prometheus metrics via a manually managed PodMonitor (port 9187)
- WAL archiving support for backups
lumen namespace
├── lumen-db-N (master — lumen-db-rw → port 5432)
├── lumen-db-N (replica — lumen-db-ro → port 5432)
└── lumen-db-N (witness — vote only, no data)
cnpg-system namespace
└── cnpg-controller-manager (operator)
Pod names use dynamic numbering (e.g.
lumen-db-4,lumen-db-5). After a failover or PVC recreation the index increments — usekubectl get pods -n lumen -l cnpg.io/cluster=lumen-dbto find the current primary.
Quorum:
3 instances → quorum = 2 → tolerates 1 failure
master fails → replica + witness vote → replica promoted → downtime < 30s
Why 1 witness instead of 3 full replicas?
With only 2 nodes (master + replica), if the master becomes unreachable, the replica can't know if it's truly dead or just partitioned — risk of split-brain (both claim to be master → data corruption). The witness is a lightweight 3rd voter (~50MB RAM, no data) that breaks the tie safely.
| Service | Target | Usage |
|---|---|---|
lumen-db-rw |
Current master | Writes (INSERT, UPDATE, DELETE) |
lumen-db-ro |
Replicas | Reads (SELECT) |
lumen-db-r |
Any instance | Internal CNPG use |
lumen-api maintains two separate connection pools:
// store/postgres.go
type PostgresStore struct {
rw *pgxpool.Pool // → lumen-db-rw:5432 (writes)
ro *pgxpool.Pool // → lumen-db-ro:5432 (reads)
}Routes:
POST /items,DELETE /items/{id}→rwpoolGET /items,GET /items/{id}→ropool
Credentials are auto-generated by CNPG in Secret lumen-db-app:
# Deployment env vars
- name: PG_RW_DSN
value: "postgresql://$(PG_USER):$(PG_PASSWORD)@lumen-db-rw.lumen.svc.cluster.local:5432/$(PG_DBNAME)"
- name: PG_RO_DSN
value: "postgresql://$(PG_USER):$(PG_PASSWORD)@lumen-db-ro.lumen.svc.cluster.local:5432/$(PG_DBNAME)"lumen-db-N (master) pod deleted
→ CNPG operator detects loss
→ replica + witness vote (quorum=2 ✅)
→ replica promoted to master
→ lumen-db-rw service endpoint updated automatically
→ lumen-api reconnects (pgxpool retries)
→ downtime < 30 seconds
Tested: deleting the primary pod triggers automatic promotion in ~30s.
Images used (from internal registry 192.168.2.2:5000):
192.168.2.2:5000/cloudnative-pg:1.25.1— operator192.168.2.2:5000/postgresql:16.6— PostgreSQL instances
Install script: install-cnpg.sh
Known constraint with kube-router (k3s built-in NetworkPolicy controller):
CNPG instance manager must reach the K8s API server (10.43.0.1:443 ClusterIP). kube-router does not evaluate ipBlock NetworkPolicy rules for ClusterIP destinations (traffic is rewritten by iptables DNAT before NetworkPolicy is evaluated). The workaround is to allow all TCP egress for CNPG pods in the allow-cnpg-intracluster NetworkPolicy.
namespaceSelector with kube-router:
Use matchLabels rather than matchExpressions in namespaceSelector rules. kube-router (embedded in k3s) has known issues with matchExpressions + operator: In producing unexpected iptables ipsets. matchLabels with kubernetes.io/metadata.name is reliable and sufficient for single-namespace selection.
CNPG exposes Prometheus metrics on port 9187 (/metrics) via the built-in pg_exporter. Key metrics:
| Metric | Description |
|---|---|
cnpg_pg_database_size_bytes |
Database size per instance |
cnpg_backends_total |
Active connections |
cnpg_pg_replication_in_recovery |
Whether instance is a replica |
cnpg_pg_replication_lag |
Replication lag in seconds |
cnpg_pg_database_xid_age |
XID age (wraparound risk) |
PodMonitor: cnpg-lumen-db in the lumen namespace with label release: kube-prometheus-stack. The enablePodMonitor field in the Cluster spec is disabled — it is deprecated in CNPG v1.25+ and the operator-created PodMonitor lacks the label required by the Prometheus CR selector.
Grafana dashboard: "CloudNativePG" (ID 20417) — loaded automatically by the Grafana sidecar from the cnpg-grafana-dashboard ConfigMap.
Applying the dashboard ConfigMap (too large for standard kubectl apply):
kubectl apply --server-side -f 03-airgap-zone/manifests/cnpg/05-grafana-dashboard.yaml# Cluster status
kubectl get cluster -n lumen
# Pod distribution
kubectl get pods -n lumen -l cnpg.io/cluster=lumen-db -o wide
# Services
kubectl get svc -n lumen | grep lumen-db
# Credentials
kubectl get secret lumen-db-app -n lumen -o jsonpath='{.data.uri}' | base64 -d
# Connect directly (debug) — replace N with current primary index
kubectl exec -it lumen-db-N -n lumen -- psql -U app app
# Find current primary
kubectl get cluster lumen-db -n lumen -o jsonpath='{.status.currentPrimary}'
# Simulate failover
kubectl delete pod $(kubectl get cluster lumen-db -n lumen -o jsonpath='{.status.currentPrimary}') -n lumen
# Watch: kubectl get cluster -n lumen -w
# Check Prometheus targets (port 9187)
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -c prometheus -- \
wget -qO- "http://localhost:9090/api/v1/targets" | python3 -c "
import sys,json; d=json.load(sys.stdin)
[print(t['scrapeUrl'], t['health']) for t in d['data']['activeTargets'] if '9187' in t.get('scrapeUrl','')]
"