Databases — Redis HA Sentinel + CloudNativePG PostgreSQL

This document covers the two stateful data stores deployed in the lumen airgap cluster.

Redis HA Sentinel

Architecture

lumen namespace
  ├── redis-master-0      (node-1, PVC 1Gi)   ← writes
  ├── redis-replica-0     (node-2)             ← reads
  └── redis-sentinel-{0,1,2}                  ← leader election (quorum=2)

1 master + 1 replica + 3 sentinels — tolerates 1 node failure
Sentinels use quorum=2 to elect a new master if the current one is unreachable
lumen-api connects via go-redis/v9 NewFailoverClient — discovers master automatically via sentinels

Connection (lumen-api)

REDIS_MODE=sentinel
REDIS_SENTINEL_ADDRS=redis-sentinel.lumen.svc.cluster.local:26379
REDIS_MASTER_NAME=mymaster

Failover behaviour

master pod dies
  → sentinels detect unavailability (within ~5s)
  → quorum reached (2/3 sentinels agree)
  → replica promoted to master
  → lumen-api reconnects automatically via sentinel
  → downtime < 10 seconds

Manifests

02-redis.yaml

CloudNativePG — PostgreSQL Cluster

Why CloudNativePG?

CNPG is a Kubernetes-native operator (CNCF sandbox) that manages PostgreSQL clusters as CRDs. Key advantages over a plain StatefulSet:

Automatic master election and failover (Raft-based, quorum voting)
Two distinct services: -rw (master only) and -ro (replicas) for read/write splitting
Auto-generated credentials in a Secret (lumen-db-app)
Prometheus metrics via a manually managed PodMonitor (port 9187)
WAL archiving support for backups

Architecture

lumen namespace
  ├── lumen-db-N   (master  — lumen-db-rw → port 5432)
  ├── lumen-db-N   (replica — lumen-db-ro → port 5432)
  └── lumen-db-N   (witness — vote only, no data)

cnpg-system namespace
  └── cnpg-controller-manager   (operator)

Pod names use dynamic numbering (e.g. lumen-db-4, lumen-db-5). After a failover or PVC recreation the index increments — use kubectl get pods -n lumen -l cnpg.io/cluster=lumen-db to find the current primary.

Quorum:

3 instances → quorum = 2 → tolerates 1 failure
master fails → replica + witness vote → replica promoted → downtime < 30s

Why 1 witness instead of 3 full replicas?

With only 2 nodes (master + replica), if the master becomes unreachable, the replica can't know if it's truly dead or just partitioned — risk of split-brain (both claim to be master → data corruption). The witness is a lightweight 3rd voter (~50MB RAM, no data) that breaks the tie safely.

Services

Service	Target	Usage
`lumen-db-rw`	Current master	Writes (INSERT, UPDATE, DELETE)
`lumen-db-ro`	Replicas	Reads (SELECT)
`lumen-db-r`	Any instance	Internal CNPG use

Read/Write Splitting (lumen-api)

lumen-api maintains two separate connection pools:

// store/postgres.go
type PostgresStore struct {
    rw *pgxpool.Pool  // → lumen-db-rw:5432 (writes)
    ro *pgxpool.Pool  // → lumen-db-ro:5432  (reads)
}

Routes:

POST /items, DELETE /items/{id} → rw pool
GET /items, GET /items/{id} → ro pool

Connection (lumen-api)

Credentials are auto-generated by CNPG in Secret lumen-db-app:

# Deployment env vars
- name: PG_RW_DSN
  value: "postgresql://$(PG_USER):$(PG_PASSWORD)@lumen-db-rw.lumen.svc.cluster.local:5432/$(PG_DBNAME)"
- name: PG_RO_DSN
  value: "postgresql://$(PG_USER):$(PG_PASSWORD)@lumen-db-ro.lumen.svc.cluster.local:5432/$(PG_DBNAME)"

Failover behaviour

lumen-db-N (master) pod deleted
  → CNPG operator detects loss
  → replica + witness vote (quorum=2 ✅)
  → replica promoted to master
  → lumen-db-rw service endpoint updated automatically
  → lumen-api reconnects (pgxpool retries)
  → downtime < 30 seconds

Tested: deleting the primary pod triggers automatic promotion in ~30s.

Airgap deployment

Images used (from internal registry 192.168.2.2:5000):

192.168.2.2:5000/cloudnative-pg:1.25.1 — operator
192.168.2.2:5000/postgresql:16.6 — PostgreSQL instances

Install script: install-cnpg.sh

Known constraint with kube-router (k3s built-in NetworkPolicy controller): CNPG instance manager must reach the K8s API server (10.43.0.1:443 ClusterIP). kube-router does not evaluate ipBlock NetworkPolicy rules for ClusterIP destinations (traffic is rewritten by iptables DNAT before NetworkPolicy is evaluated). The workaround is to allow all TCP egress for CNPG pods in the allow-cnpg-intracluster NetworkPolicy.

namespaceSelector with kube-router: Use matchLabels rather than matchExpressions in namespaceSelector rules. kube-router (embedded in k3s) has known issues with matchExpressions + operator: In producing unexpected iptables ipsets. matchLabels with kubernetes.io/metadata.name is reliable and sufficient for single-namespace selection.

Monitoring

CNPG exposes Prometheus metrics on port 9187 (/metrics) via the built-in pg_exporter. Key metrics:

Metric	Description
`cnpg_pg_database_size_bytes`	Database size per instance
`cnpg_backends_total`	Active connections
`cnpg_pg_replication_in_recovery`	Whether instance is a replica
`cnpg_pg_replication_lag`	Replication lag in seconds
`cnpg_pg_database_xid_age`	XID age (wraparound risk)

PodMonitor: cnpg-lumen-db in the lumen namespace with label release: kube-prometheus-stack. The enablePodMonitor field in the Cluster spec is disabled — it is deprecated in CNPG v1.25+ and the operator-created PodMonitor lacks the label required by the Prometheus CR selector.

Grafana dashboard: "CloudNativePG" (ID 20417) — loaded automatically by the Grafana sidecar from the cnpg-grafana-dashboard ConfigMap.

Applying the dashboard ConfigMap (too large for standard kubectl apply):

kubectl apply --server-side -f 03-airgap-zone/manifests/cnpg/05-grafana-dashboard.yaml

Manifests

Useful commands

# Cluster status
kubectl get cluster -n lumen

# Pod distribution
kubectl get pods -n lumen -l cnpg.io/cluster=lumen-db -o wide

# Services
kubectl get svc -n lumen | grep lumen-db

# Credentials
kubectl get secret lumen-db-app -n lumen -o jsonpath='{.data.uri}' | base64 -d

# Connect directly (debug) — replace N with current primary index
kubectl exec -it lumen-db-N -n lumen -- psql -U app app

# Find current primary
kubectl get cluster lumen-db -n lumen -o jsonpath='{.status.currentPrimary}'

# Simulate failover
kubectl delete pod $(kubectl get cluster lumen-db -n lumen -o jsonpath='{.status.currentPrimary}') -n lumen
# Watch: kubectl get cluster -n lumen -w

# Check Prometheus targets (port 9187)
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -c prometheus -- \
  wget -qO- "http://localhost:9090/api/v1/targets" | python3 -c "
import sys,json; d=json.load(sys.stdin)
[print(t['scrapeUrl'], t['health']) for t in d['data']['activeTargets'] if '9187' in t.get('scrapeUrl','')]
"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Databases — Redis HA Sentinel + CloudNativePG PostgreSQL

Redis HA Sentinel

Architecture

Connection (lumen-api)

Failover behaviour

Manifests

CloudNativePG — PostgreSQL Cluster

Why CloudNativePG?

Architecture

Services

Read/Write Splitting (lumen-api)

Connection (lumen-api)

Failover behaviour

Airgap deployment

Monitoring

Manifests

Useful commands

FilesExpand file tree

databases.md

Latest commit

History

databases.md

File metadata and controls

Databases — Redis HA Sentinel + CloudNativePG PostgreSQL

Redis HA Sentinel

Architecture

Connection (lumen-api)

Failover behaviour

Manifests

CloudNativePG — PostgreSQL Cluster

Why CloudNativePG?

Architecture

Services

Read/Write Splitting (lumen-api)

Connection (lumen-api)

Failover behaviour

Airgap deployment

Monitoring

Manifests

Useful commands