This document covers the full CI/CD pipeline: automated build/push via Gitea Actions, progressive canary delivery via Argo Rollouts, and automatic promotion/rollback via Prometheus AnalysisTemplate.
- Architecture Overview
- Gitea Actions CI Pipeline
- Argo Rollouts — Canary Deployments
- AnalysisTemplate — Automatic Promotion
- Full Deploy Workflow (end-to-end)
- Operational Commands
- ArgoCD + Argo Rollouts Integration
- Troubleshooting
Developer
└── git tag v1.x.x && git push gitea v1.x.x
│
▼
Gitea Actions (act_runner v0.2.11, node-1)
├── go test ./...
├── docker build → 192.168.2.2:5000/lumen-api:v1.x.x
├── trivy scan (HIGH/CRITICAL, offline)
├── docker push (:semver + :sha + :latest)
└── sed manifest + git push → Gitea
│
▼
ArgoCD (detects diff in manifest)
│
▼
Argo Rollouts (canary strategy)
├── Step 1: 20% → new version
├── Step 2: AnalysisRun ──→ Prometheus (5×1min, taux succès ≥ 95%)
│ ├── ✅ pass → continue
│ └── ✗ fail (×3) → rollback auto
├── Step 3: 80% → new version
├── Step 4: AnalysisRun ──→ Prometheus (5×1min)
│ ├── ✅ pass → continue
│ └── ✗ fail (×3) → rollback auto
└── Step 5: 100% → full promotion (automatic, no human needed)
Key principle: Git is the single source of truth. ArgoCD watches the repo, Argo Rollouts controls traffic, Prometheus decides promotion.
| Event | Job triggered |
|---|---|
Push to main (app/Dockerfile paths) |
test + build-push |
Pull request to main |
test only |
Git tag v* |
test + release |
| Manual dispatch | all |
- go test ./... -v -count=1Runs against 01-connected-zone/app/, fails the pipeline on any test failure.
- Compute
SHORT_SHA(7 chars) fromgithub.sha docker build→ tags:$SHORT_SHA+:latest- Trivy scan (offline DB from
/var/cache/trivy, exit-code 0 — non-blocking) docker pushboth tags to192.168.2.2:5000
- Build → tags
:$SEMVER+:$SHORT_SHA+:latest - Push all 3 tags
- Update manifest:
sedreplaces the image tag in03-airgap-zone/manifests/app/03-lumen-api.yaml git commit + git pushback to Gitea → triggers ArgoCD sync
| Component | Details |
|---|---|
| Runner | act_runner v0.2.11, K8s Deployment in gitea namespace, pinned to node-1 |
| Job container | Custom image 192.168.2.2:5000/lumen-api:builder (golang:1.26-alpine + docker-cli + git) |
| Docker socket | /var/run/docker.sock mounted (builds directly on node-1) |
| Trivy DB | Offline at /var/cache/trivy (pre-populated in connected zone) |
| Registry | 192.168.2.2:5000 (node-1 Docker registry, port 5000) |
| Gitea token | CI_TOKEN secret in Gitea repo settings |
The runner was registered with --instance http://192.168.2.2:30300 (Gitea NodePort), not the K8s internal DNS. This URL becomes GITHUB_SERVER_URL in job containers. Checkout steps use with: github-server-url: http://192.168.2.2:30300 as explicit override.
Standard Deployment does a rolling update: all pods switched at once, no traffic control. Argo Rollouts adds:
- Canary: send X% traffic to new version, pause, validate, continue
- Blue/Green: switch instantly between two full environments
- Analysis: auto-promote/rollback based on Prometheus metrics
lumen namespace
├── Rollout lumen-api ← replaces Deployment
│ ├── ReplicaSet stable ← current version (e.g. v1.5.3)
│ └── ReplicaSet canary ← new version during rollout
├── Service lumen-api ← used by Traefik IngressRoute + Prometheus (unchanged)
├── Service lumen-api-stable ← receives main traffic (managed by Argo Rollouts)
└── Service lumen-api-canary ← receives canary traffic (managed by Argo Rollouts)
argo-rollouts namespace
├── controller Deployment ← patches Services + watches Rollouts
└── dashboard Deployment ← UI at localhost:3100 (port-forward)
strategy:
canary:
stableService: lumen-api-stable
canaryService: lumen-api-canary
steps:
- setWeight: 20 # 20% canary, 80% stable
- analysis: # auto-rollback if success rate < 95% for 3 consecutive checks
templates:
- templateName: success-rate
- setWeight: 80 # 80% canary, 20% stable
- analysis: # second check before full promotion
templates:
- templateName: success-rate
- setWeight: 100 # full promotionThe AnalysisTemplate success-rate (05-analysis-template.yaml) queries Prometheus every minute. If the HTTP success rate drops below 95% for 3 consecutive checks, the Rollout is automatically aborted and traffic falls back to the stable version.
Argo Rollouts achieves traffic weighting by scaling the ReplicaSets proportionally:
- At 20%: 1 canary pod / 2 stable pods = ~33% actual (rounded up)
- At 80%: 2 canary pods / 1 stable pod = ~67% actual
- At 100%: 2 canary pods / 0 stable pods, stable ReplicaSet scaled to 0
Argo Rollouts CRDs must exist before lumen-app tries to create a Rollout resource:
Wave 2: argo-rollouts Application (installs CRDs + controller)
Wave 3: lumen-app Application (creates Rollout resource)
lumen-app also has SkipDryRunOnMissingResource=true as a safety net in case wave ordering isn't respected.
At each analysis step, Argo Rollouts creates an AnalysisRun that queries Prometheus on a schedule. Based on the result, it either advances the Rollout or triggers an automatic rollback — no human intervention needed.
setWeight: 20
└── AnalysisRun starts
├── check 1/5 : success rate = 99% ✅
├── check 2/5 : success rate = 98% ✅
├── check 3/5 : success rate = 97% ✅
├── check 4/5 : success rate = 99% ✅
└── check 5/5 : success rate = 98% ✅ → Successful → setWeight: 80
If 3 consecutive checks fail (rate < 95%):
├── check 1/5 : success rate = 40% ✗
├── check 2/5 : success rate = 35% ✗
└── check 3/5 : success rate = 20% ✗ → consecutiveErrors=3 → RolloutAborted
→ traffic back to 100% stable immediately
File: 05-analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
namespace: lumen
spec:
metrics:
- name: success-rate
interval: 1m # evaluate every 1 minute
count: 5 # 5 checks total (~5 min per analysis step)
successCondition: "len(result) == 0 || result[0] >= 0.95"
failureLimit: 3 # 3 consecutive failures → rollback
provider:
prometheus:
address: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090
query: |
sum(rate(http_requests_total{app="lumen-api",status!~"5.."}[2m]))
/
sum(rate(http_requests_total{app="lumen-api"}[2m]))Key design decisions:
count: 5— required when usinginterval. Without it, Argo Rollouts rejects the template as "runs indefinitely".len(result) == 0 || ...— guards against empty Prometheus result (no traffic yet → NaN → would be treated as failure without this guard).[2m]window — short enough to detect issues quickly, long enough to have meaningful data.
# List all AnalysisRuns for the current rollout revision
kubectl get analysisrun -n lumen
# Detailed view (metrics, error messages)
kubectl describe analysisrun <name> -n lumen
# Quick status
kubectl argo rollouts get rollout lumen-api -n lumen
# Look for: ✔ Successful / ◌ Running / ✖ ErrorThe Argo Rollouts controller (in argo-rollouts namespace) must reach Prometheus (in monitoring namespace). Added to 19-allow-argo-rollouts.yaml:
# argo-rollouts → monitoring:9090
egress:
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: monitoring
ports:
- protocol: TCP
port: 9090# 1. Make code changes in 01-connected-zone/app/
# 2. Commit and push to Gitea
git add .
git commit -m "feat: my change"
git push gitea main # or: git push-all
# 3. Create and push a semver tag
git tag v1.5.4
git push gitea v1.5.4
# → CI triggers: test → build → push image → update manifest → git push
# → ArgoCD detects manifest diff → syncs → Argo Rollouts starts canary at 20%# Static view
kubectl argo rollouts get rollout lumen-api -n lumen
# Live watch
kubectl argo rollouts get rollout lumen-api -n lumen --watchExample output at 20% during analysis:
Name: lumen-api
Status: ◌ Progressing
Message: AnalysisRunRunning
Strategy: Canary
Step: 2/5
SetWeight: 20
ActualWeight: 33
Images: 192.168.2.2:5000/lumen-api:v1.5.3 (stable)
192.168.2.2:5000/lumen-api:v1.5.4 (canary)
Replicas:
Desired: 2 Current: 3 Updated: 1 Ready: 3
If the analysis passes, the Rollout automatically continues to 80% then 100% — no manual action needed.
# Skip the current analysis step and force-promote
kubectl argo rollouts promote lumen-api -n lumenkubectl argo rollouts abort lumen-api -n lumen
# → traffic immediately back to 100% stable
# → canary ReplicaSet scaled down# Current rollout state
kubectl argo rollouts get rollout lumen-api -n lumen
# Watch in real time
kubectl argo rollouts get rollout lumen-api -n lumen --watch
# List all rollouts
kubectl argo rollouts list rollouts -n lumen
# Dashboard UI (port-forward)
kubectl argo rollouts dashboard
# → http://localhost:3100# Promote one step (20% → 80%, or 80% → 100%)
kubectl argo rollouts promote lumen-api -n lumen
# Skip all pauses and go straight to 100%
kubectl argo rollouts promote lumen-api -n lumen --full
# Abort → immediate rollback to stable
kubectl argo rollouts abort lumen-api -n lumen
# Retry after abort
kubectl argo rollouts retry rollout lumen-api -n lumenWarning: do NOT use
kubectl argo rollouts set image— ArgoCDselfHeal: truewill immediately revert the change back to Git state.
Instead, edit the manifest in Git and push:
# Edit 03-airgap-zone/manifests/app/03-lumen-api.yaml
# Change: image: 192.168.2.2:5000/lumen-api:v1.5.3
# To: image: 192.168.2.2:5000/lumen-api:v1.5.4
git add 03-airgap-zone/manifests/app/03-lumen-api.yaml
git commit -m "chore: bump lumen-api to v1.5.4"
git push gitea main
# → ArgoCD syncs → Argo Rollouts starts canary# Stable service endpoints (should point to stable pods)
kubectl get endpoints lumen-api-stable -n lumen
# Canary service endpoints (should point to canary pods)
kubectl get endpoints lumen-api-canary -n lumen
# Check which pods are in which ReplicaSet
kubectl get pods -n lumen -l app=lumen-api -o wide| Application | Wave | Namespace | Source path |
|---|---|---|---|
argo-rollouts |
2 | argo-rollouts |
manifests/argo-rollouts-helm |
lumen-app |
3 | lumen |
manifests/app |
The manifests/argocd/ folder is not watched by any ArgoCD Application — these files must be applied manually:
kubectl apply -f 03-airgap-zone/manifests/argocd/17-application-argo-rollouts.yamlArgo Rollouts installs CRDs that K8s auto-populates with fields (/status, /spec/preserveUnknownFields). Without ignoreDifferences, ArgoCD shows the app as OutOfSync permanently:
# 17-application-argo-rollouts.yaml
ignoreDifferences:
- group: apiextensions.k8s.io
kind: CustomResourceDefinition
jsonPointers:
- /spec/preserveUnknownFields
- /status
syncOptions:
- RespectIgnoreDifferences=truelumen-app has selfHeal: true. This means:
- Any
kubectlchange to a Rollout spec is reverted within ~30s - The only way to trigger a real canary is by changing the manifest in Git
- This is the correct GitOps behavior — Git is the source of truth
This is expected and correct: ArgoCD reflects the Rollout's state. When a canary is in progress and waiting at a pause: {} step, ArgoCD shows the app as "Paused". It becomes "Healthy" once the Rollout fully promotes to 100%.
Check if ArgoCD reverted a manual image change:
kubectl argo rollouts get rollout lumen-api -n lumen
# If images show only one version → no canary in progress, just paused at step
# → promote to complete
kubectl argo rollouts promote lumen-api -n lumenThe image doesn't exist in the airgap registry:
# Check the image exists
ssh ubuntu@192.168.2.2 "curl -s http://localhost:5000/v2/lumen-api/tags/list"
# If missing: the push script must run FROM node-1, not from macOS
# The registry at 192.168.2.2:5000 is node-1's local Docker registry
# From macOS, localhost:5000 is OrbStack → different registry!
ssh ubuntu@192.168.2.2 "docker push 192.168.2.2:5000/lumen-api:<tag>"# Check runner is registered and online
kubectl get pods -n gitea -l app=act-runner
kubectl logs -n gitea -l app=act-runner --tail=50
# Check Gitea received the tag
# → https://gitea.airgap.local/lumen/lumen/releases
# Re-push the tag (delete + recreate)
git tag -d v1.x.x
git push gitea :refs/tags/v1.x.x
git tag v1.x.x
git push gitea v1.x.xExpected — caused by K8s auto-populating CRD fields. Verify ignoreDifferences is present:
kubectl get application argo-rollouts -n argocd -o yaml | grep -A 10 ignoreDifferencesIf missing, re-apply the Application manifest:
kubectl apply -f 03-airgap-zone/manifests/argocd/17-application-argo-rollouts.yamlIf SkipDryRunOnMissingResource=true is not set and Argo Rollouts CRDs don't exist yet:
kubectl get application lumen-app -n argocd -o yaml | grep -A5 syncOptions
# Should include: SkipDryRunOnMissingResource=true# See why the analysis failed
kubectl get analysisrun -n lumen
kubectl describe analysisrun <name> -n lumen
# Look for: "Message: Metric 'success-rate' assessed Failed"
# Check the Prometheus query manually
kubectl exec -n argo-rollouts deploy/argo-rollouts -- \
wget -qO- "http://kube-prometheus-stack-prometheus.monitoring:9090/api/v1/query?query=sum(rate(http_requests_total{app=\"lumen-api\"}[2m]))"
# If empty result → no traffic yet (avoid triggering canary with zero requests)If the canary pod gets no traffic yet (0 requests), the query returns NaN which Argo Rollouts treats as failure. Generate some traffic first:
for i in $(seq 1 20); do curl -sk https://lumen-api.airgap.local/health > /dev/null; doneAfter abort, the Rollout goes to Degraded state. Run retry to restore stable:
kubectl argo rollouts abort lumen-api -n lumen
kubectl argo rollouts retry rollout lumen-api -n lumensteps:
- setWeight: 20
- pause: {} # waits forever until `promote` command
- setWeight: 100Promotes manually. Best for learning and validation.
steps:
- setWeight: 20
- pause: {duration: 10m} # auto-promotes after 10 minutes
- setWeight: 100Promotes automatically after the timeout. Can still abort during the wait.
steps:
- setWeight: 20
- analysis:
templates:
- templateName: success-rate
- setWeight: 80
- analysis:
templates:
- templateName: success-rate
- setWeight: 100AnalysisTemplate (05-analysis-template.yaml) queries Prometheus every minute:
successCondition: result[0] >= 0.95 # 95% HTTP success rate
failureLimit: 3 # 3 consecutive failures → auto rollback
query: |
sum(rate(http_requests_total{app="lumen-api",status!~"5.."}[2m]))
/
sum(rate(http_requests_total{app="lumen-api"}[2m]))If success rate drops below 95% for 3 checks in a row → automatic rollback to stable. No human intervention needed.