Skip to content

Migrate to Envoy Gateway: ingress-nginx → Gateway API + cert-manager gatewayHTTPRoute #144

@themightychris

Description

@themightychris

Plan for migrating cfp-live-cluster off ingress-nginx onto Envoy Gateway

ingress-nginx is EOL (March 2026). Sister sandbox cluster (cfp-sandbox-cluster) just completed the equivalent migration; this issue captures the staged process for live, incorporating what worked and what we'd do differently.

Updated 2026-05-18 after sandbox migration fully landed. Target civic-cloud is v1.9.2 which includes hairpin-proxy removal — no separate decommission needed (was a phase 2.5 in earlier draft). Per-app HTTPRoute parentRefs simplified. New phase 3.5 for HTTP→HTTPS redirect. New phase 5.5 for projection-driven decommission of ingress-nginx.

Updated 2026-05-18 (post-deploy) after live's phases 1–3.5 deployed in #145. Added required manual coredns cleanup to phase 1 (hairpin-proxy-controller patches kube-system coredns ConfigMap at runtime — deletion of the controller doesn't undo what it already wrote). Sharpened the post-deploy DNS verification expectation. Bumped hologit-flake retry expectation for repos with many upstream sources.

Phase 1: Update cluster-template and prereqs

Goal: get the Gateway API foundation in place without affecting current traffic.

  1. Bump civic-cloud source ref in this repo from v1.7.7v1.9.2 (latest). This brings:
    • cert-manager 1.13.3 → 1.20.2 (Gateway API integration enabled, ListenerSets feature gate enabled)
    • Gateway API v1.5.1 CRDs (standard channel)
    • Envoy Gateway v1.7.3 controller (installs to envoy-gateway-system)
    • hairpin-proxy removed from the LKE blueprint (see note below)
    • Server-side apply for CRDs in the deploy workflow (necessary — gateway-api CRDs are too large for client-side apply)
  2. Add _infra/envoy-gateway/ directory with three files (copy from cfp-sandbox-cluster):
    • gatewayclass.yamlGatewayClass eg
    • envoyproxy.yamlEnvoyProxy shared, mergeGateways: true (essential: one LoadBalancer for all Gateways)
    • main-gateway.yamlGateway main-gateway with HTTP catchall listener, allowedRoutes.namespaces.from: All (cert-manager solver target + HTTP path for HTTP→HTTPS redirect)

Note on hairpin-proxy

The v1.9.2 bump drops hairpin-proxy from the GitOps projection — the deploy workflow's "Apply manifests: deleted resources" step will remove the controller Deployment, haproxy Deployment, RBAC, namespace, and the coredns-custom ConfigMap.

Why it goes away: Linode LKE now supports LoadBalancer hairpin natively (in-cluster pods can reach the cluster's own LB external IP). hairpin-proxy was the workaround for that limitation. Even more important: with ingress-nginx phasing out, hairpin-proxy's haproxy backend is hardcoded to ingress-nginx — so it was actively misrouting in-cluster traffic away from Envoy and breaking cert-manager HTTP-01 self-checks (this bit us on sandbox before phase 3).

Verification after phase 1 deploys: confirm in-cluster DNS now resolves to the external Envoy LB IP, not a ClusterIP:

KUBECONFIG=~/.kube/cfp-live-cluster-kubeconfig.yaml kubectl run -i --rm test-dns \
  --image=curlimages/curl --restart=Never --quiet --timeout=15s \
  -- sh -c 'nslookup any-live-hostname.live.k8s.phl.io 2>&1 | tail -4'

Expect the Address line to be a public LB IP (the existing ingress-nginx LB IP if you haven't cut DNS over yet, or the new Envoy LB IP kubectl get svc -n envoy-gateway-system once a hostname's DNS has moved). The key thing being verified is that it's NOT a 10.x.x.x ClusterIP (which would mean hairpin-proxy was still intercepting) and NOT NXDOMAIN (which would mean stale rewrites — see required cleanup below).

Required cleanup after phase 1 deploys — strip stale rewrites from kube-system coredns ⚠️

The hairpin-proxy-controller continuously patches the kube-system/coredns ConfigMap at runtime, inserting rewrite name <host> hairpin-proxy.hairpin-proxy.svc.cluster.local # Added by hairpin-proxy lines for each public hostname into the main Corefile. Deleting the controller via GitOps does NOT remove what it already wrote — those patches survive in the live ConfigMap and now point at a deleted Service.

On live this left 14 dead rewrites after the phase-1 deploy. Symptom: in-cluster DNS for those hostnames briefly returns NXDOMAIN; even after it settles, the Corefile is full of stale dead-end rewrites that should be cleaned up.

The coredns-custom ConfigMap that gets deleted by the projection is a separate file (loaded via import custom/*.include) — that one is correctly removed by the GitOps deletion step. The runtime patches into the main Corefile are what need manual cleanup.

KUBECONFIG=~/.kube/<cluster>-kubeconfig.yaml
# 1. Confirm how many stale rewrites are present
kubectl get cm -n kube-system coredns -o jsonpath='{.data.Corefile}' | grep -c 'Added by hairpin-proxy'

# 2. Back up first
kubectl get cm -n kube-system coredns -o yaml > coredns-cm-backup.yaml

# 3. Patch out the rewrites
CLEANED=$(kubectl get cm -n kube-system coredns -o jsonpath='{.data.Corefile}' | sed '/# Added by hairpin-proxy/d')
kubectl patch cm -n kube-system coredns --type=merge -p "$(jq -nc --arg c "$CLEANED" '{data:{Corefile:$c}}')"

# 4. Wait ~30s for coredns reload plugin to pick it up (no pod restart needed)
sleep 30 && kubectl get cm -n kube-system coredns -o jsonpath='{.data.Corefile}' | grep -c 'Added by hairpin-proxy'  # should print 0

The reload plugin in the standard LKE Corefile hot-swaps the config without restarting coredns pods. Verify by re-running the DNS test from the previous step.

Heads-up on cert-manager 1.18 changes:

  • Certificate.spec.privateKey.rotationPolicy default changes from NeverAlways. Every Cert without an explicit rotationPolicy: Never will rotate keys on next renewal. Standard HTTPS clients don't care; pinned-key clients (none expected here) would break.
  • Certificate.spec.revisionHistoryLimit default changes from unset → 1. Old CertificateRequest resources get garbage-collected.

Heads-up on hologit flaky CI: the Build k8s-manifests workflow intermittently fails with fatal: shallow file has changed since we read it — concurrent git fetch --depth=1 race in hologit's source-fetching code. Tracked at JarvusInnovations/hologit#450. Workaround: rerun the workflow. Retry count scales with the number of upstream sources — sandbox typically hit 1-2 retries; live needed 4 in #145 because it pulls more chart pins (each concurrent fetch has a small race window, so cumulative probability of at least one failure rises with source count).

At the end of phase 1: cluster has new infrastructure (Envoy Gateway + new CRDs + cert-manager 1.20), hairpin-proxy is gone, traffic still flows through ingress-nginx, nothing visibly changes for users.

Phase 2: Set up a parallel cluster issuer

Goal: a new ClusterIssuer using the Gateway-native solver, leaving the existing letsencrypt-prod (with the ingress-nginx solver) untouched.

Add new ClusterIssuers, e.g. letsencrypt-prod-gateway and letsencrypt-staging-gateway, in _infra/cert-manager/issuers.yaml:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod-gateway
spec:
  acme:
    email: services@codeforphilly.org
    server: https://acme-v02.api.letsencrypt.org/directory
    privateKeySecretRef:
      name: letsencrypt-prod-gateway
    solvers:
      - http01:
          gatewayHTTPRoute:
            parentRefs:
              - group: gateway.networking.k8s.io
                kind: Gateway
                name: main-gateway
                namespace: envoy-gateway-system

Why parallel and not mutate (this is the main lesson from the sandbox): mutating the existing letsencrypt-prod solver would change renewal behavior for all existing Certs, including the Ingress-managed ones. They'd still renew (via the new HTTPRoute solver if DNS reaches Envoy), but it couples paths in a way that's harder to reason about and harder to revert.

Existing Certs continue using letsencrypt-prod (the old solver). New Gateway-issued Certs use letsencrypt-prod-gateway. Clean separation.

Phase 3: Provision new Gateway + HTTPRoute resources for every domain

Goal: every public hostname has an Envoy Gateway path ready to receive traffic. No DNS changes yet — everything still flows through ingress-nginx in production.

Use _gateways/ central pile (one file per app). Each file:

---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: <app>
  namespace: <app>
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod-gateway
spec:
  gatewayClassName: eg
  listeners:
    - name: https
      protocol: HTTPS
      port: 443
      hostname: <host>
      tls:
        mode: Terminate
        certificateRefs:
          - name: <app>-gw-tls
      allowedRoutes:
        namespaces:
          from: Same
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: <app>
  namespace: <app>
spec:
  parentRefs:
    - name: <app>          # per-app HTTPS gateway only — HTTP handled by global redirect (phase 3.5)
  hostnames:
    - <host>
  rules:
    - backendRefs:
        - name: <backend-service>
          port: <port>

Important parentRefs detail: per-app HTTPRoutes attach only to the per-app HTTPS Gateway. They do NOT reference main-gateway. HTTP traffic on port 80 is handled by a single global redirect HTTPRoute (phase 3.5).

cert-manager auto-creates its own short-lived HTTPRoute per ACME challenge, with pathType: Exact on /.well-known/acme-challenge/<token>. That route attaches to both main-gateway and the per-app Gateway (the latter rejects because it has no HTTP listener — fine, the main-gateway attachment carries the validation traffic).

Domain inventory (from current Ingresses)

Namespace Hostname(s) Notes
balancer balancerproject.org apex domain, no wildcard
browserless-chrome browserless-chrome.live.k8s.phl.io wildcard-subdomain (if a wildcard exists)
chime penn-chime.phl.io, penn-chime.live.k8s.phl.io apex + subdomain
choose-native-plants choose-native-plants.live.k8s.phl.io, choosenativeplants.com, <www.choosenativeplants.com> apex + www + subdomain
code-for-philly codeforphilly.org, <www.codeforphilly.org>, codeforphilly.live.k8s.phl.io apex + www + subdomain
echo-http echo-http.live.k8s.phl.io subdomain
grafana metrics.live.k8s.phl.io subdomain
sealed-secrets sealed-secrets.live.k8s.phl.io subdomain
third-places third-places.live.k8s.phl.io subdomain
vaultwarden vaultwarden.phl.io, bitwarden.phl.io apex + alias

For multi-hostname apps: one Gateway with multiple HTTPS listeners (one per hostname, each with its own cert), OR one listener with a multi-SAN cert. Per-hostname listener is the clean Gateway-API pattern.

Cert Secret naming: use <app>-gw-tls suffix to avoid collision with existing Ingress-managed Certs (<app>-tls). Both can coexist until each app's Ingress is removed.

Test with staging issuer first per app — issue via letsencrypt-staging-gateway to validate the whole path (Gateway provisioning, cert-manager solver HTTPRoute creation, ACME challenge routing). Avoids Let's Encrypt rate limits and gives a safe smoke test before producing real certs.

Note on apex domain ACME challenges: cert-manager's HTTP-01 solver hits the apex hostname over port 80. For this to reach Envoy, the apex A record needs to point at Envoy's LB OR DNS needs a CNAME flattening that resolves through Envoy. If the apex still points at ingress-nginx (haven't migrated yet), the challenge will fail. So provisioning the new cert for an apex domain blocks on the DNS cutover for that domain. Plan order: migrate DNS for apex domains AS the cert is provisioned (DNS cutover and cert issuance happen together for apex domains, separately for wildcard-resolved subdomains).

At the end of phase 3: new Gateway resources exist, all <app>-gw-tls Certs are in Issuing / pending state with failed to perform self check ... EOF because cert-manager's in-cluster self-check resolves each hostname to whatever the current wildcard or apex A record points at — and at this point that's still ingress-nginx (no Ingress for the solver path → EOF). Per-hostname certs unblock as DNS cuts over in phase 4. (Note: on sandbox the wildcard was flipped to Envoy mid-migration so non-apex certs issued without per-hostname DNS work, but the safer pattern documented in phase 4 below is per-hostname cutover with the wildcard staying on ingress-nginx until phase 5.) No traffic moved.

Phase 3.5: Add HTTP→HTTPS redirect

Goal: force HTTPS for all traffic reaching Envoy's HTTP listener, without breaking ACME validation.

Add _infra/envoy-gateway/http-redirect.yaml:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: http-redirect
  namespace: envoy-gateway-system
spec:
  parentRefs:
    - name: main-gateway
  # No hostnames → matches anything reaching main-gateway's HTTP listener.
  # cert-manager's per-challenge HTTPRoute uses pathType: Exact on
  # /.well-known/acme-challenge/<token> and a hostname filter — both more
  # specific than this rule — so ACME validation traffic bypasses the
  # redirect and reaches the solver Pod.
  rules:
    - filters:
        - type: RequestRedirect
          requestRedirect:
            scheme: https
            statusCode: 301

How this stays compatible with ACME: Gateway API conflict resolution prefers more-specific HTTPRoutes. cert-manager's solver HTTPRoute has both a hostname filter AND pathType: Exact on the challenge URL — both more specific than this rule's "no hostname, default / prefix." So validation traffic always wins.

Safe to add anytime after phase 2 (parallel issuer exists). Doesn't depend on per-app Gateways being ready.

Phase 4: Async DNS migration per hostname

Goal: move each hostname to Envoy's LB IP independently, on each project team's timeline.

For each hostname:

  1. Verify the corresponding Gateway/HTTPRoute is healthy in the cluster: kubectl get gateway -n <app>, listener Programmed=True, HTTPRoute Accepted=True.
  2. Verify cert is issued (kubectl get cert <app>-gw-tls, Ready: True). For apex domains, this may not be possible until the DNS cutover happens — flip them simultaneously.
  3. Update the A record (or CNAME): point hostname at Envoy LB IP.
  4. Wait DNS TTL (typically 5 min).
  5. Test: curl https://<host>/ resolves through Envoy; cert is valid; backend responds. curl http://<host>/ returns 301 → https://<host>/.
  6. Don't delete the old Ingress yet — leave it as a fallback. The DNS now bypasses it, but if something breaks, reverting DNS routes back to ingress-nginx + old Ingress.

For wildcard-covered subdomains (*.live.k8s.phl.io): the wildcard A record stays pointed at ingress-nginx. Per-hostname migration uses specific A records that override the wildcard for that hostname. (Standard DNS behavior — a specific record always wins over wildcard for the same name.)

Suggested order (low risk → high risk):

  1. echo-http, sealed-secrets — internal services, low-stakes
  2. metrics.live.k8s.phl.io (grafana), third-places, browserless-chrome — moderate
  3. choose-native-plants.live.k8s.phl.io, chime (subdomain first, then apex)
  4. code-for-philly.live.k8s.phl.io (subdomain) before codeforphilly.org + www
  5. Production apex domains last: balancerproject.org, vaultwarden.phl.io, bitwarden.phl.io, choosenativeplants.com, codeforphilly.org

For each apex domain, the cert can only issue after DNS points at Envoy (HTTP-01 challenge needs to reach Envoy). For these you may want a backout plan ready in case the new path has issues — keep DNS TTLs short during the cutover.

At the end of phase 4: every hostname's DNS points at Envoy; ingress-nginx still serves stale traffic from clients with cached DNS, but no fresh traffic.

Phase 5: Disable per-app Ingresses

Goal: turn off Ingress generation at the source so the next phase removes them via GitOps in one shot.

For each app whose Helm chart/manifests live in this repo, set ingress.enabled: false in its release-values.yaml. For kustomize-managed apps (e.g. balancer), drop the Ingress resource from kustomization.yaml. For raw-YAML apps (e.g. echo-http), remove the Ingress doc.

Wait at least 1 hour after the last DNS cutover (longest plausible client DNS cache window) before doing this. The Ingresses stop being useful the moment DNS cuts over, but leaving them as fallback during that hour is cheap insurance.

Also identify any Ingresses managed by external CIs (e.g. preview-deploy workflows, app-side deploy pipelines). Those need coordination with each project team — they should ship a release that stops creating the Ingress. Once their CI is fixed, kubectl delete ingress the orphans.

On sandbox, the external-CI Ingresses were:

  • code-for-philly/latest, laddr/latest (laddr's emergence-site Helm chart)
  • codeforphilly-rewrite-sandbox/codeforphilly (rewrite project's own kustomize)

Live likely has similar setups. Tracked separately in #159 for sandbox.

Side effect for some charts: disabling the Ingress may also strip auxiliary config that read from ingress.hosts[0]. On sandbox this hit:

  • grafana: server.domain config (used for absolute URLs in emails etc.) — restore via grafana.ini.server.domain
  • metabase: MB_SITE_URL env var (auth callbacks, embeds) — restore via configs.metabase.MB_SITE_URL

Spot-check each chart's render diff after setting ingress.enabled: false and add direct hostname config if the chart was inferring it from the Ingress.

Phase 5.5: Decommission ingress-nginx via projection

Goal: remove ingress-nginx itself from the cluster.

Add an exclusion to .holo/branches/k8s-manifests/_civic-cloud.toml so the ingress-nginx files stop projecting:

[holomapping]
holosource = "=>k8s-blueprint-lke"
files = [
    "**",
    "!ingress-nginx/**",
    "!.holo/lenses/ingress-nginx.toml",
]
before = "*"

Why per-cluster, not upstream: cluster-template still ships ingress-nginx for other clusters that haven't migrated. The exclusion lives in each downstream cluster repo at decommission time.

Both excludes are needed: the first drops the helm chart input. The second drops the lens config — without it, the helm3 lens errors trying to read ingress-nginx/ from the (now empty) input tree.

On commit + deploy, 19 resources drop out:

  • Cluster-scoped: Namespace, 2× ClusterRole, 2× ClusterRoleBinding, ValidatingWebhookConfiguration, IngressClass
  • Namespace-scoped: Deployment, 2× Service (incl. the LoadBalancer — Linode frees the LB), 2× ServiceAccount, 2× Role, 2× RoleBinding, ConfigMap, 2× Job

If phase 5 was done in the same PR, the Ingress resources also disappear at the same time. Otherwise stale Ingresses sit as orphans referencing a non-existent IngressClass — inert, cleanup later.

Phase 6: Final cleanup

  1. Delete the old Ingress-managed Certs (<app>-tls) once nothing references them. They'll stop renewing once cert-manager's letsencrypt-prod ClusterIssuer is gone, so they'd expire eventually anyway — but cleaning them up sooner is tidier.
  2. Delete the old letsencrypt-prod ClusterIssuer (or leave for posterity).
  3. Remove server-side-apply special handling for the gateway-api CRDs (if any) — they continue working via standard apply now.
  4. Audit for stale <app>-tls Secrets, orphan Certs from canceled Ingresses, etc.

Lessons from cfp-sandbox-cluster (do/don't list)

Do

  • Use parallel ClusterIssuer instead of mutating the existing one.
  • Use new Secret names (<app>-gw-tls) to avoid collision with existing Ingress-managed Certs.
  • Pre-populate central _gateways/ pile before any DNS cuts over. Per-app files delete cleanly when each project ships their own.
  • mergeGateways: true on EnvoyProxy is mandatory. Without it, every Gateway gets its own LB. Cost explodes.
  • Single global HTTP→HTTPS redirect HTTPRoute on main-gateway (phase 3.5). One resource, no per-app duplication, ACME still works via Gateway API conflict resolution.
  • Verify in-cluster DNS resolves externally after the phase 1 deploy.
  • Audit helm chart values that read from ingress.hosts (grafana, metabase on sandbox) and set the host directly before disabling the Ingress.
  • Test via kubectl apply -f <workspace-file> before merging — same content as GitOps, no drift, fast feedback. Don't kubectl apply -f - from heredocs.
  • Per-hostname HTTPS listeners, one per app's Gateway. cert-manager + the annotation does the rest.

Don't

  • Don't add main-gateway to per-app HTTPRoute parentRefs. The HTTP→HTTPS redirect on main-gateway handles all HTTP traffic. Per-app HTTPRoutes only attach to their per-app HTTPS Gateway.
  • Don't flip wildcard DNS in one shot. On sandbox this broke every hostname that didn't yet have an HTTPRoute. On live, per-hostname DNS cutover is much safer.
  • Don't delete HTTPRoutes manually after a kubectl apply — the apply already updated them. Deletion after means traffic interruption + cert-manager retries that hit transient routing windows. Trust the apply.
  • Don't issue gateway certs before hairpin-proxy is gone. This is automatic if you follow phases in order (phase 1 deploys → hairpin-proxy goes away → then phase 3 issues certs). The hazard is only if you go out of order and try to issue gateway certs while hairpin-proxy is still routing in-cluster DNS through ingress-nginx — cert-manager's HTTP-01 self-check uses cluster DNS and will fail with "wrong status code '404'" or "got: <!doctype html>" (the app backend's HTML instead of the token).
  • Don't use --depth=1 plus high source counts without expecting the hologit fetch race. Reruns are part of life until JarvusInnovations/hologit#450 lands.
  • Don't expect ListenerSet to be useful yet. Envoy Gateway v1.7.3 doesn't reconcile ListenerSet resources (lands in v1.8). Stick with per-project Gateways + mergeGateways.
  • Don't refresh hologit sources with raw git fetch <url> <refspec> — it auto-pulls upstream tags into local refs/tags/, polluting your tag namespace. Use git holo source fetch <name> instead.

Open questions

  • Apex domain ACME: confirm Let's Encrypt can validate against apex domains via Envoy. (Sandbox didn't have apex domains; only subdomain hosts. Live has many apex domains.) If something goes wrong, fallback is DNS-01 (one wildcard cert per zone, or per-name DNS-01).
  • Stuck-pod inventory for live — sandbox had two zombies we couldn't address (paws-data-pipeline missing paws-salesforce Secret, codeforphilly-rewrite-sandbox in ImagePullBackOff). Audit live for similar conditions before cutting over so they don't get blamed on the migration.
  • External-CI-managed Ingresses on live — survey upfront which apps' Ingresses come from external CI pipelines (likely laddr, possibly others). Each needs source-side coordination, not just a values change in this repo.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions