Plan for migrating cfp-live-cluster off ingress-nginx onto Envoy Gateway
ingress-nginx is EOL (March 2026). Sister sandbox cluster (cfp-sandbox-cluster) just completed the equivalent migration; this issue captures the staged process for live, incorporating what worked and what we'd do differently.
Updated 2026-05-18 after sandbox migration fully landed. Target civic-cloud is v1.9.2 which includes hairpin-proxy removal — no separate decommission needed (was a phase 2.5 in earlier draft). Per-app HTTPRoute parentRefs simplified. New phase 3.5 for HTTP→HTTPS redirect. New phase 5.5 for projection-driven decommission of ingress-nginx.
Updated 2026-05-18 (post-deploy) after live's phases 1–3.5 deployed in #145. Added required manual coredns cleanup to phase 1 (hairpin-proxy-controller patches kube-system coredns ConfigMap at runtime — deletion of the controller doesn't undo what it already wrote). Sharpened the post-deploy DNS verification expectation. Bumped hologit-flake retry expectation for repos with many upstream sources.
Phase 1: Update cluster-template and prereqs
Goal: get the Gateway API foundation in place without affecting current traffic.
- Bump civic-cloud source ref in this repo from
v1.7.7 → v1.9.2 (latest). This brings:
- cert-manager 1.13.3 → 1.20.2 (Gateway API integration enabled,
ListenerSets feature gate enabled)
- Gateway API v1.5.1 CRDs (standard channel)
- Envoy Gateway v1.7.3 controller (installs to
envoy-gateway-system)
- hairpin-proxy removed from the LKE blueprint (see note below)
- Server-side apply for CRDs in the deploy workflow (necessary — gateway-api CRDs are too large for client-side apply)
- Add
_infra/envoy-gateway/ directory with three files (copy from cfp-sandbox-cluster):
gatewayclass.yaml — GatewayClass eg
envoyproxy.yaml — EnvoyProxy shared, mergeGateways: true (essential: one LoadBalancer for all Gateways)
main-gateway.yaml — Gateway main-gateway with HTTP catchall listener, allowedRoutes.namespaces.from: All (cert-manager solver target + HTTP path for HTTP→HTTPS redirect)
Note on hairpin-proxy
The v1.9.2 bump drops hairpin-proxy from the GitOps projection — the deploy workflow's "Apply manifests: deleted resources" step will remove the controller Deployment, haproxy Deployment, RBAC, namespace, and the coredns-custom ConfigMap.
Why it goes away: Linode LKE now supports LoadBalancer hairpin natively (in-cluster pods can reach the cluster's own LB external IP). hairpin-proxy was the workaround for that limitation. Even more important: with ingress-nginx phasing out, hairpin-proxy's haproxy backend is hardcoded to ingress-nginx — so it was actively misrouting in-cluster traffic away from Envoy and breaking cert-manager HTTP-01 self-checks (this bit us on sandbox before phase 3).
Verification after phase 1 deploys: confirm in-cluster DNS now resolves to the external Envoy LB IP, not a ClusterIP:
KUBECONFIG=~/.kube/cfp-live-cluster-kubeconfig.yaml kubectl run -i --rm test-dns \
--image=curlimages/curl --restart=Never --quiet --timeout=15s \
-- sh -c 'nslookup any-live-hostname.live.k8s.phl.io 2>&1 | tail -4'
Expect the Address line to be a public LB IP (the existing ingress-nginx LB IP if you haven't cut DNS over yet, or the new Envoy LB IP kubectl get svc -n envoy-gateway-system once a hostname's DNS has moved). The key thing being verified is that it's NOT a 10.x.x.x ClusterIP (which would mean hairpin-proxy was still intercepting) and NOT NXDOMAIN (which would mean stale rewrites — see required cleanup below).
Required cleanup after phase 1 deploys — strip stale rewrites from kube-system coredns ⚠️
The hairpin-proxy-controller continuously patches the kube-system/coredns ConfigMap at runtime, inserting rewrite name <host> hairpin-proxy.hairpin-proxy.svc.cluster.local # Added by hairpin-proxy lines for each public hostname into the main Corefile. Deleting the controller via GitOps does NOT remove what it already wrote — those patches survive in the live ConfigMap and now point at a deleted Service.
On live this left 14 dead rewrites after the phase-1 deploy. Symptom: in-cluster DNS for those hostnames briefly returns NXDOMAIN; even after it settles, the Corefile is full of stale dead-end rewrites that should be cleaned up.
The coredns-custom ConfigMap that gets deleted by the projection is a separate file (loaded via import custom/*.include) — that one is correctly removed by the GitOps deletion step. The runtime patches into the main Corefile are what need manual cleanup.
KUBECONFIG=~/.kube/<cluster>-kubeconfig.yaml
# 1. Confirm how many stale rewrites are present
kubectl get cm -n kube-system coredns -o jsonpath='{.data.Corefile}' | grep -c 'Added by hairpin-proxy'
# 2. Back up first
kubectl get cm -n kube-system coredns -o yaml > coredns-cm-backup.yaml
# 3. Patch out the rewrites
CLEANED=$(kubectl get cm -n kube-system coredns -o jsonpath='{.data.Corefile}' | sed '/# Added by hairpin-proxy/d')
kubectl patch cm -n kube-system coredns --type=merge -p "$(jq -nc --arg c "$CLEANED" '{data:{Corefile:$c}}')"
# 4. Wait ~30s for coredns reload plugin to pick it up (no pod restart needed)
sleep 30 && kubectl get cm -n kube-system coredns -o jsonpath='{.data.Corefile}' | grep -c 'Added by hairpin-proxy' # should print 0
The reload plugin in the standard LKE Corefile hot-swaps the config without restarting coredns pods. Verify by re-running the DNS test from the previous step.
Heads-up on cert-manager 1.18 changes:
Certificate.spec.privateKey.rotationPolicy default changes from Never → Always. Every Cert without an explicit rotationPolicy: Never will rotate keys on next renewal. Standard HTTPS clients don't care; pinned-key clients (none expected here) would break.
Certificate.spec.revisionHistoryLimit default changes from unset → 1. Old CertificateRequest resources get garbage-collected.
Heads-up on hologit flaky CI: the Build k8s-manifests workflow intermittently fails with fatal: shallow file has changed since we read it — concurrent git fetch --depth=1 race in hologit's source-fetching code. Tracked at JarvusInnovations/hologit#450. Workaround: rerun the workflow. Retry count scales with the number of upstream sources — sandbox typically hit 1-2 retries; live needed 4 in #145 because it pulls more chart pins (each concurrent fetch has a small race window, so cumulative probability of at least one failure rises with source count).
At the end of phase 1: cluster has new infrastructure (Envoy Gateway + new CRDs + cert-manager 1.20), hairpin-proxy is gone, traffic still flows through ingress-nginx, nothing visibly changes for users.
Phase 2: Set up a parallel cluster issuer
Goal: a new ClusterIssuer using the Gateway-native solver, leaving the existing letsencrypt-prod (with the ingress-nginx solver) untouched.
Add new ClusterIssuers, e.g. letsencrypt-prod-gateway and letsencrypt-staging-gateway, in _infra/cert-manager/issuers.yaml:
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod-gateway
spec:
acme:
email: services@codeforphilly.org
server: https://acme-v02.api.letsencrypt.org/directory
privateKeySecretRef:
name: letsencrypt-prod-gateway
solvers:
- http01:
gatewayHTTPRoute:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: main-gateway
namespace: envoy-gateway-system
Why parallel and not mutate (this is the main lesson from the sandbox): mutating the existing letsencrypt-prod solver would change renewal behavior for all existing Certs, including the Ingress-managed ones. They'd still renew (via the new HTTPRoute solver if DNS reaches Envoy), but it couples paths in a way that's harder to reason about and harder to revert.
Existing Certs continue using letsencrypt-prod (the old solver). New Gateway-issued Certs use letsencrypt-prod-gateway. Clean separation.
Phase 3: Provision new Gateway + HTTPRoute resources for every domain
Goal: every public hostname has an Envoy Gateway path ready to receive traffic. No DNS changes yet — everything still flows through ingress-nginx in production.
Use _gateways/ central pile (one file per app). Each file:
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: <app>
namespace: <app>
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod-gateway
spec:
gatewayClassName: eg
listeners:
- name: https
protocol: HTTPS
port: 443
hostname: <host>
tls:
mode: Terminate
certificateRefs:
- name: <app>-gw-tls
allowedRoutes:
namespaces:
from: Same
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: <app>
namespace: <app>
spec:
parentRefs:
- name: <app> # per-app HTTPS gateway only — HTTP handled by global redirect (phase 3.5)
hostnames:
- <host>
rules:
- backendRefs:
- name: <backend-service>
port: <port>
Important parentRefs detail: per-app HTTPRoutes attach only to the per-app HTTPS Gateway. They do NOT reference main-gateway. HTTP traffic on port 80 is handled by a single global redirect HTTPRoute (phase 3.5).
cert-manager auto-creates its own short-lived HTTPRoute per ACME challenge, with pathType: Exact on /.well-known/acme-challenge/<token>. That route attaches to both main-gateway and the per-app Gateway (the latter rejects because it has no HTTP listener — fine, the main-gateway attachment carries the validation traffic).
Domain inventory (from current Ingresses)
| Namespace |
Hostname(s) |
Notes |
| balancer |
balancerproject.org |
apex domain, no wildcard |
| browserless-chrome |
browserless-chrome.live.k8s.phl.io |
wildcard-subdomain (if a wildcard exists) |
| chime |
penn-chime.phl.io, penn-chime.live.k8s.phl.io |
apex + subdomain |
| choose-native-plants |
choose-native-plants.live.k8s.phl.io, choosenativeplants.com, <www.choosenativeplants.com> |
apex + www + subdomain |
| code-for-philly |
codeforphilly.org, <www.codeforphilly.org>, codeforphilly.live.k8s.phl.io |
apex + www + subdomain |
| echo-http |
echo-http.live.k8s.phl.io |
subdomain |
| grafana |
metrics.live.k8s.phl.io |
subdomain |
| sealed-secrets |
sealed-secrets.live.k8s.phl.io |
subdomain |
| third-places |
third-places.live.k8s.phl.io |
subdomain |
| vaultwarden |
vaultwarden.phl.io, bitwarden.phl.io |
apex + alias |
For multi-hostname apps: one Gateway with multiple HTTPS listeners (one per hostname, each with its own cert), OR one listener with a multi-SAN cert. Per-hostname listener is the clean Gateway-API pattern.
Cert Secret naming: use <app>-gw-tls suffix to avoid collision with existing Ingress-managed Certs (<app>-tls). Both can coexist until each app's Ingress is removed.
Test with staging issuer first per app — issue via letsencrypt-staging-gateway to validate the whole path (Gateway provisioning, cert-manager solver HTTPRoute creation, ACME challenge routing). Avoids Let's Encrypt rate limits and gives a safe smoke test before producing real certs.
Note on apex domain ACME challenges: cert-manager's HTTP-01 solver hits the apex hostname over port 80. For this to reach Envoy, the apex A record needs to point at Envoy's LB OR DNS needs a CNAME flattening that resolves through Envoy. If the apex still points at ingress-nginx (haven't migrated yet), the challenge will fail. So provisioning the new cert for an apex domain blocks on the DNS cutover for that domain. Plan order: migrate DNS for apex domains AS the cert is provisioned (DNS cutover and cert issuance happen together for apex domains, separately for wildcard-resolved subdomains).
At the end of phase 3: new Gateway resources exist, all <app>-gw-tls Certs are in Issuing / pending state with failed to perform self check ... EOF because cert-manager's in-cluster self-check resolves each hostname to whatever the current wildcard or apex A record points at — and at this point that's still ingress-nginx (no Ingress for the solver path → EOF). Per-hostname certs unblock as DNS cuts over in phase 4. (Note: on sandbox the wildcard was flipped to Envoy mid-migration so non-apex certs issued without per-hostname DNS work, but the safer pattern documented in phase 4 below is per-hostname cutover with the wildcard staying on ingress-nginx until phase 5.) No traffic moved.
Phase 3.5: Add HTTP→HTTPS redirect
Goal: force HTTPS for all traffic reaching Envoy's HTTP listener, without breaking ACME validation.
Add _infra/envoy-gateway/http-redirect.yaml:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: http-redirect
namespace: envoy-gateway-system
spec:
parentRefs:
- name: main-gateway
# No hostnames → matches anything reaching main-gateway's HTTP listener.
# cert-manager's per-challenge HTTPRoute uses pathType: Exact on
# /.well-known/acme-challenge/<token> and a hostname filter — both more
# specific than this rule — so ACME validation traffic bypasses the
# redirect and reaches the solver Pod.
rules:
- filters:
- type: RequestRedirect
requestRedirect:
scheme: https
statusCode: 301
How this stays compatible with ACME: Gateway API conflict resolution prefers more-specific HTTPRoutes. cert-manager's solver HTTPRoute has both a hostname filter AND pathType: Exact on the challenge URL — both more specific than this rule's "no hostname, default / prefix." So validation traffic always wins.
Safe to add anytime after phase 2 (parallel issuer exists). Doesn't depend on per-app Gateways being ready.
Phase 4: Async DNS migration per hostname
Goal: move each hostname to Envoy's LB IP independently, on each project team's timeline.
For each hostname:
- Verify the corresponding Gateway/HTTPRoute is healthy in the cluster:
kubectl get gateway -n <app>, listener Programmed=True, HTTPRoute Accepted=True.
- Verify cert is issued (
kubectl get cert <app>-gw-tls, Ready: True). For apex domains, this may not be possible until the DNS cutover happens — flip them simultaneously.
- Update the A record (or CNAME): point hostname at Envoy LB IP.
- Wait DNS TTL (typically 5 min).
- Test:
curl https://<host>/ resolves through Envoy; cert is valid; backend responds. curl http://<host>/ returns 301 → https://<host>/.
- Don't delete the old Ingress yet — leave it as a fallback. The DNS now bypasses it, but if something breaks, reverting DNS routes back to ingress-nginx + old Ingress.
For wildcard-covered subdomains (*.live.k8s.phl.io): the wildcard A record stays pointed at ingress-nginx. Per-hostname migration uses specific A records that override the wildcard for that hostname. (Standard DNS behavior — a specific record always wins over wildcard for the same name.)
Suggested order (low risk → high risk):
echo-http, sealed-secrets — internal services, low-stakes
metrics.live.k8s.phl.io (grafana), third-places, browserless-chrome — moderate
choose-native-plants.live.k8s.phl.io, chime (subdomain first, then apex)
code-for-philly.live.k8s.phl.io (subdomain) before codeforphilly.org + www
- Production apex domains last:
balancerproject.org, vaultwarden.phl.io, bitwarden.phl.io, choosenativeplants.com, codeforphilly.org
For each apex domain, the cert can only issue after DNS points at Envoy (HTTP-01 challenge needs to reach Envoy). For these you may want a backout plan ready in case the new path has issues — keep DNS TTLs short during the cutover.
At the end of phase 4: every hostname's DNS points at Envoy; ingress-nginx still serves stale traffic from clients with cached DNS, but no fresh traffic.
Phase 5: Disable per-app Ingresses
Goal: turn off Ingress generation at the source so the next phase removes them via GitOps in one shot.
For each app whose Helm chart/manifests live in this repo, set ingress.enabled: false in its release-values.yaml. For kustomize-managed apps (e.g. balancer), drop the Ingress resource from kustomization.yaml. For raw-YAML apps (e.g. echo-http), remove the Ingress doc.
Wait at least 1 hour after the last DNS cutover (longest plausible client DNS cache window) before doing this. The Ingresses stop being useful the moment DNS cuts over, but leaving them as fallback during that hour is cheap insurance.
Also identify any Ingresses managed by external CIs (e.g. preview-deploy workflows, app-side deploy pipelines). Those need coordination with each project team — they should ship a release that stops creating the Ingress. Once their CI is fixed, kubectl delete ingress the orphans.
On sandbox, the external-CI Ingresses were:
code-for-philly/latest, laddr/latest (laddr's emergence-site Helm chart)
codeforphilly-rewrite-sandbox/codeforphilly (rewrite project's own kustomize)
Live likely has similar setups. Tracked separately in #159 for sandbox.
Side effect for some charts: disabling the Ingress may also strip auxiliary config that read from ingress.hosts[0]. On sandbox this hit:
- grafana:
server.domain config (used for absolute URLs in emails etc.) — restore via grafana.ini.server.domain
- metabase:
MB_SITE_URL env var (auth callbacks, embeds) — restore via configs.metabase.MB_SITE_URL
Spot-check each chart's render diff after setting ingress.enabled: false and add direct hostname config if the chart was inferring it from the Ingress.
Phase 5.5: Decommission ingress-nginx via projection
Goal: remove ingress-nginx itself from the cluster.
Add an exclusion to .holo/branches/k8s-manifests/_civic-cloud.toml so the ingress-nginx files stop projecting:
[holomapping]
holosource = "=>k8s-blueprint-lke"
files = [
"**",
"!ingress-nginx/**",
"!.holo/lenses/ingress-nginx.toml",
]
before = "*"
Why per-cluster, not upstream: cluster-template still ships ingress-nginx for other clusters that haven't migrated. The exclusion lives in each downstream cluster repo at decommission time.
Both excludes are needed: the first drops the helm chart input. The second drops the lens config — without it, the helm3 lens errors trying to read ingress-nginx/ from the (now empty) input tree.
On commit + deploy, 19 resources drop out:
- Cluster-scoped: Namespace, 2× ClusterRole, 2× ClusterRoleBinding, ValidatingWebhookConfiguration, IngressClass
- Namespace-scoped: Deployment, 2× Service (incl. the LoadBalancer — Linode frees the LB), 2× ServiceAccount, 2× Role, 2× RoleBinding, ConfigMap, 2× Job
If phase 5 was done in the same PR, the Ingress resources also disappear at the same time. Otherwise stale Ingresses sit as orphans referencing a non-existent IngressClass — inert, cleanup later.
Phase 6: Final cleanup
- Delete the old Ingress-managed Certs (
<app>-tls) once nothing references them. They'll stop renewing once cert-manager's letsencrypt-prod ClusterIssuer is gone, so they'd expire eventually anyway — but cleaning them up sooner is tidier.
- Delete the old
letsencrypt-prod ClusterIssuer (or leave for posterity).
- Remove server-side-apply special handling for the gateway-api CRDs (if any) — they continue working via standard apply now.
- Audit for stale
<app>-tls Secrets, orphan Certs from canceled Ingresses, etc.
Lessons from cfp-sandbox-cluster (do/don't list)
Do
- Use parallel ClusterIssuer instead of mutating the existing one.
- Use new Secret names (
<app>-gw-tls) to avoid collision with existing Ingress-managed Certs.
- Pre-populate central
_gateways/ pile before any DNS cuts over. Per-app files delete cleanly when each project ships their own.
mergeGateways: true on EnvoyProxy is mandatory. Without it, every Gateway gets its own LB. Cost explodes.
- Single global HTTP→HTTPS redirect HTTPRoute on main-gateway (phase 3.5). One resource, no per-app duplication, ACME still works via Gateway API conflict resolution.
- Verify in-cluster DNS resolves externally after the phase 1 deploy.
- Audit helm chart values that read from
ingress.hosts (grafana, metabase on sandbox) and set the host directly before disabling the Ingress.
- Test via
kubectl apply -f <workspace-file> before merging — same content as GitOps, no drift, fast feedback. Don't kubectl apply -f - from heredocs.
- Per-hostname HTTPS listeners, one per app's Gateway. cert-manager + the annotation does the rest.
Don't
- Don't add
main-gateway to per-app HTTPRoute parentRefs. The HTTP→HTTPS redirect on main-gateway handles all HTTP traffic. Per-app HTTPRoutes only attach to their per-app HTTPS Gateway.
- Don't flip wildcard DNS in one shot. On sandbox this broke every hostname that didn't yet have an HTTPRoute. On live, per-hostname DNS cutover is much safer.
- Don't delete HTTPRoutes manually after a
kubectl apply — the apply already updated them. Deletion after means traffic interruption + cert-manager retries that hit transient routing windows. Trust the apply.
- Don't issue gateway certs before hairpin-proxy is gone. This is automatic if you follow phases in order (phase 1 deploys → hairpin-proxy goes away → then phase 3 issues certs). The hazard is only if you go out of order and try to issue gateway certs while hairpin-proxy is still routing in-cluster DNS through ingress-nginx — cert-manager's HTTP-01 self-check uses cluster DNS and will fail with "wrong status code '404'" or "got: <!doctype html>" (the app backend's HTML instead of the token).
- Don't use
--depth=1 plus high source counts without expecting the hologit fetch race. Reruns are part of life until JarvusInnovations/hologit#450 lands.
- Don't expect ListenerSet to be useful yet. Envoy Gateway v1.7.3 doesn't reconcile ListenerSet resources (lands in v1.8). Stick with per-project Gateways +
mergeGateways.
- Don't refresh hologit sources with raw
git fetch <url> <refspec> — it auto-pulls upstream tags into local refs/tags/, polluting your tag namespace. Use git holo source fetch <name> instead.
Open questions
- Apex domain ACME: confirm Let's Encrypt can validate against apex domains via Envoy. (Sandbox didn't have apex domains; only subdomain hosts. Live has many apex domains.) If something goes wrong, fallback is DNS-01 (one wildcard cert per zone, or per-name DNS-01).
- Stuck-pod inventory for live — sandbox had two zombies we couldn't address (paws-data-pipeline missing
paws-salesforce Secret, codeforphilly-rewrite-sandbox in ImagePullBackOff). Audit live for similar conditions before cutting over so they don't get blamed on the migration.
- External-CI-managed Ingresses on live — survey upfront which apps' Ingresses come from external CI pipelines (likely laddr, possibly others). Each needs source-side coordination, not just a values change in this repo.
References
Plan for migrating cfp-live-cluster off ingress-nginx onto Envoy Gateway
ingress-nginx is EOL (March 2026). Sister sandbox cluster (cfp-sandbox-cluster) just completed the equivalent migration; this issue captures the staged process for live, incorporating what worked and what we'd do differently.
Phase 1: Update cluster-template and prereqs
Goal: get the Gateway API foundation in place without affecting current traffic.
v1.7.7→v1.9.2(latest). This brings:ListenerSetsfeature gate enabled)envoy-gateway-system)_infra/envoy-gateway/directory with three files (copy from cfp-sandbox-cluster):gatewayclass.yaml—GatewayClass egenvoyproxy.yaml—EnvoyProxy shared,mergeGateways: true(essential: one LoadBalancer for all Gateways)main-gateway.yaml—Gateway main-gatewaywith HTTP catchall listener,allowedRoutes.namespaces.from: All(cert-manager solver target + HTTP path for HTTP→HTTPS redirect)Note on hairpin-proxy
The v1.9.2 bump drops hairpin-proxy from the GitOps projection — the deploy workflow's "Apply manifests: deleted resources" step will remove the controller Deployment, haproxy Deployment, RBAC, namespace, and the
coredns-customConfigMap.Why it goes away: Linode LKE now supports LoadBalancer hairpin natively (in-cluster pods can reach the cluster's own LB external IP). hairpin-proxy was the workaround for that limitation. Even more important: with ingress-nginx phasing out, hairpin-proxy's haproxy backend is hardcoded to ingress-nginx — so it was actively misrouting in-cluster traffic away from Envoy and breaking cert-manager HTTP-01 self-checks (this bit us on sandbox before phase 3).
Verification after phase 1 deploys: confirm in-cluster DNS now resolves to the external Envoy LB IP, not a ClusterIP:
Expect the Address line to be a public LB IP (the existing ingress-nginx LB IP if you haven't cut DNS over yet, or the new Envoy LB IP
kubectl get svc -n envoy-gateway-systemonce a hostname's DNS has moved). The key thing being verified is that it's NOT a10.x.x.xClusterIP (which would mean hairpin-proxy was still intercepting) and NOT NXDOMAIN (which would mean stale rewrites — see required cleanup below).Required cleanup after phase 1 deploys — strip stale rewrites from kube-system coredns⚠️
The hairpin-proxy-controller continuously patches the
kube-system/corednsConfigMap at runtime, insertingrewrite name <host> hairpin-proxy.hairpin-proxy.svc.cluster.local # Added by hairpin-proxylines for each public hostname into the main Corefile. Deleting the controller via GitOps does NOT remove what it already wrote — those patches survive in the live ConfigMap and now point at a deleted Service.On live this left 14 dead rewrites after the phase-1 deploy. Symptom: in-cluster DNS for those hostnames briefly returns NXDOMAIN; even after it settles, the Corefile is full of stale dead-end rewrites that should be cleaned up.
The
coredns-customConfigMap that gets deleted by the projection is a separate file (loaded viaimport custom/*.include) — that one is correctly removed by the GitOps deletion step. The runtime patches into the main Corefile are what need manual cleanup.The
reloadplugin in the standard LKE Corefile hot-swaps the config without restarting coredns pods. Verify by re-running the DNS test from the previous step.Heads-up on cert-manager 1.18 changes:
Certificate.spec.privateKey.rotationPolicydefault changes fromNever→Always. Every Cert without an explicitrotationPolicy: Neverwill rotate keys on next renewal. Standard HTTPS clients don't care; pinned-key clients (none expected here) would break.Certificate.spec.revisionHistoryLimitdefault changes from unset →1. OldCertificateRequestresources get garbage-collected.Heads-up on hologit flaky CI: the
Build k8s-manifestsworkflow intermittently fails withfatal: shallow file has changed since we read it— concurrentgit fetch --depth=1race in hologit's source-fetching code. Tracked at JarvusInnovations/hologit#450. Workaround: rerun the workflow. Retry count scales with the number of upstream sources — sandbox typically hit 1-2 retries; live needed 4 in #145 because it pulls more chart pins (each concurrent fetch has a small race window, so cumulative probability of at least one failure rises with source count).At the end of phase 1: cluster has new infrastructure (Envoy Gateway + new CRDs + cert-manager 1.20), hairpin-proxy is gone, traffic still flows through ingress-nginx, nothing visibly changes for users.
Phase 2: Set up a parallel cluster issuer
Goal: a new ClusterIssuer using the Gateway-native solver, leaving the existing
letsencrypt-prod(with the ingress-nginx solver) untouched.Add new ClusterIssuers, e.g.
letsencrypt-prod-gatewayandletsencrypt-staging-gateway, in_infra/cert-manager/issuers.yaml:Why parallel and not mutate (this is the main lesson from the sandbox): mutating the existing
letsencrypt-prodsolver would change renewal behavior for all existing Certs, including the Ingress-managed ones. They'd still renew (via the new HTTPRoute solver if DNS reaches Envoy), but it couples paths in a way that's harder to reason about and harder to revert.Existing Certs continue using
letsencrypt-prod(the old solver). New Gateway-issued Certs useletsencrypt-prod-gateway. Clean separation.Phase 3: Provision new Gateway + HTTPRoute resources for every domain
Goal: every public hostname has an Envoy Gateway path ready to receive traffic. No DNS changes yet — everything still flows through ingress-nginx in production.
Use
_gateways/central pile (one file per app). Each file:Important parentRefs detail: per-app HTTPRoutes attach only to the per-app HTTPS Gateway. They do NOT reference
main-gateway. HTTP traffic on port 80 is handled by a single global redirect HTTPRoute (phase 3.5).cert-manager auto-creates its own short-lived HTTPRoute per ACME challenge, with
pathType: Exacton/.well-known/acme-challenge/<token>. That route attaches to both main-gateway and the per-app Gateway (the latter rejects because it has no HTTP listener — fine, the main-gateway attachment carries the validation traffic).Domain inventory (from current Ingresses)
For multi-hostname apps: one Gateway with multiple HTTPS listeners (one per hostname, each with its own cert), OR one listener with a multi-SAN cert. Per-hostname listener is the clean Gateway-API pattern.
Cert Secret naming: use
<app>-gw-tlssuffix to avoid collision with existing Ingress-managed Certs (<app>-tls). Both can coexist until each app's Ingress is removed.Test with staging issuer first per app — issue via
letsencrypt-staging-gatewayto validate the whole path (Gateway provisioning, cert-manager solver HTTPRoute creation, ACME challenge routing). Avoids Let's Encrypt rate limits and gives a safe smoke test before producing real certs.Note on apex domain ACME challenges: cert-manager's HTTP-01 solver hits the apex hostname over port 80. For this to reach Envoy, the apex A record needs to point at Envoy's LB OR DNS needs a CNAME flattening that resolves through Envoy. If the apex still points at ingress-nginx (haven't migrated yet), the challenge will fail. So provisioning the new cert for an apex domain blocks on the DNS cutover for that domain. Plan order: migrate DNS for apex domains AS the cert is provisioned (DNS cutover and cert issuance happen together for apex domains, separately for wildcard-resolved subdomains).
At the end of phase 3: new Gateway resources exist, all
<app>-gw-tlsCerts are inIssuing/ pending state withfailed to perform self check ... EOFbecause cert-manager's in-cluster self-check resolves each hostname to whatever the current wildcard or apex A record points at — and at this point that's still ingress-nginx (no Ingress for the solver path → EOF). Per-hostname certs unblock as DNS cuts over in phase 4. (Note: on sandbox the wildcard was flipped to Envoy mid-migration so non-apex certs issued without per-hostname DNS work, but the safer pattern documented in phase 4 below is per-hostname cutover with the wildcard staying on ingress-nginx until phase 5.) No traffic moved.Phase 3.5: Add HTTP→HTTPS redirect
Goal: force HTTPS for all traffic reaching Envoy's HTTP listener, without breaking ACME validation.
Add
_infra/envoy-gateway/http-redirect.yaml:How this stays compatible with ACME: Gateway API conflict resolution prefers more-specific HTTPRoutes. cert-manager's solver HTTPRoute has both a hostname filter AND
pathType: Exacton the challenge URL — both more specific than this rule's "no hostname, default/prefix." So validation traffic always wins.Safe to add anytime after phase 2 (parallel issuer exists). Doesn't depend on per-app Gateways being ready.
Phase 4: Async DNS migration per hostname
Goal: move each hostname to Envoy's LB IP independently, on each project team's timeline.
For each hostname:
kubectl get gateway -n <app>, listenerProgrammed=True, HTTPRouteAccepted=True.kubectl get cert <app>-gw-tls,Ready: True). For apex domains, this may not be possible until the DNS cutover happens — flip them simultaneously.curl https://<host>/resolves through Envoy; cert is valid; backend responds.curl http://<host>/returns301 → https://<host>/.For wildcard-covered subdomains (
*.live.k8s.phl.io): the wildcard A record stays pointed at ingress-nginx. Per-hostname migration uses specific A records that override the wildcard for that hostname. (Standard DNS behavior — a specific record always wins over wildcard for the same name.)Suggested order (low risk → high risk):
echo-http,sealed-secrets— internal services, low-stakesmetrics.live.k8s.phl.io(grafana),third-places,browserless-chrome— moderatechoose-native-plants.live.k8s.phl.io,chime(subdomain first, then apex)code-for-philly.live.k8s.phl.io(subdomain) beforecodeforphilly.org+wwwbalancerproject.org,vaultwarden.phl.io,bitwarden.phl.io,choosenativeplants.com,codeforphilly.orgFor each apex domain, the cert can only issue after DNS points at Envoy (HTTP-01 challenge needs to reach Envoy). For these you may want a backout plan ready in case the new path has issues — keep DNS TTLs short during the cutover.
At the end of phase 4: every hostname's DNS points at Envoy; ingress-nginx still serves stale traffic from clients with cached DNS, but no fresh traffic.
Phase 5: Disable per-app Ingresses
Goal: turn off Ingress generation at the source so the next phase removes them via GitOps in one shot.
For each app whose Helm chart/manifests live in this repo, set
ingress.enabled: falsein its release-values.yaml. For kustomize-managed apps (e.g. balancer), drop the Ingress resource fromkustomization.yaml. For raw-YAML apps (e.g. echo-http), remove the Ingress doc.Wait at least 1 hour after the last DNS cutover (longest plausible client DNS cache window) before doing this. The Ingresses stop being useful the moment DNS cuts over, but leaving them as fallback during that hour is cheap insurance.
Also identify any Ingresses managed by external CIs (e.g. preview-deploy workflows, app-side deploy pipelines). Those need coordination with each project team — they should ship a release that stops creating the Ingress. Once their CI is fixed,
kubectl delete ingressthe orphans.On sandbox, the external-CI Ingresses were:
code-for-philly/latest,laddr/latest(laddr's emergence-site Helm chart)codeforphilly-rewrite-sandbox/codeforphilly(rewrite project's own kustomize)Live likely has similar setups. Tracked separately in #159 for sandbox.
Side effect for some charts: disabling the Ingress may also strip auxiliary config that read from
ingress.hosts[0]. On sandbox this hit:server.domainconfig (used for absolute URLs in emails etc.) — restore viagrafana.ini.server.domainMB_SITE_URLenv var (auth callbacks, embeds) — restore viaconfigs.metabase.MB_SITE_URLSpot-check each chart's render diff after setting
ingress.enabled: falseand add direct hostname config if the chart was inferring it from the Ingress.Phase 5.5: Decommission ingress-nginx via projection
Goal: remove ingress-nginx itself from the cluster.
Add an exclusion to
.holo/branches/k8s-manifests/_civic-cloud.tomlso the ingress-nginx files stop projecting:Why per-cluster, not upstream: cluster-template still ships ingress-nginx for other clusters that haven't migrated. The exclusion lives in each downstream cluster repo at decommission time.
Both excludes are needed: the first drops the helm chart input. The second drops the lens config — without it, the helm3 lens errors trying to read
ingress-nginx/from the (now empty) input tree.On commit + deploy, 19 resources drop out:
If phase 5 was done in the same PR, the Ingress resources also disappear at the same time. Otherwise stale Ingresses sit as orphans referencing a non-existent IngressClass — inert, cleanup later.
Phase 6: Final cleanup
<app>-tls) once nothing references them. They'll stop renewing once cert-manager'sletsencrypt-prodClusterIssuer is gone, so they'd expire eventually anyway — but cleaning them up sooner is tidier.letsencrypt-prodClusterIssuer (or leave for posterity).<app>-tlsSecrets, orphan Certs from canceled Ingresses, etc.Lessons from cfp-sandbox-cluster (do/don't list)
Do
<app>-gw-tls) to avoid collision with existing Ingress-managed Certs._gateways/pile before any DNS cuts over. Per-app files delete cleanly when each project ships their own.mergeGateways: trueon EnvoyProxy is mandatory. Without it, every Gateway gets its own LB. Cost explodes.ingress.hosts(grafana, metabase on sandbox) and set the host directly before disabling the Ingress.kubectl apply -f <workspace-file>before merging — same content as GitOps, no drift, fast feedback. Don'tkubectl apply -f -from heredocs.Don't
main-gatewayto per-app HTTPRoute parentRefs. The HTTP→HTTPS redirect on main-gateway handles all HTTP traffic. Per-app HTTPRoutes only attach to their per-app HTTPS Gateway.kubectl apply— the apply already updated them. Deletion after means traffic interruption + cert-manager retries that hit transient routing windows. Trust the apply.--depth=1plus high source counts without expecting the hologit fetch race. Reruns are part of life until JarvusInnovations/hologit#450 lands.mergeGateways.git fetch <url> <refspec>— it auto-pulls upstream tags into localrefs/tags/, polluting your tag namespace. Usegit holo source fetch <name>instead.Open questions
paws-salesforceSecret, codeforphilly-rewrite-sandbox in ImagePullBackOff). Audit live for similar conditions before cutting over so they don't get blamed on the migration.References