Migrate to Envoy Gateway: ingress-nginx → Gateway API + cert-manager gatewayHTTPRoute

# Plan for migrating cfp-live-cluster off ingress-nginx onto Envoy Gateway

ingress-nginx is EOL (March 2026). Sister sandbox cluster (cfp-sandbox-cluster) just completed the equivalent migration; this issue captures the staged process for live, incorporating what worked and what we'd do differently.

> **Updated 2026-05-18** after sandbox migration fully landed. Target civic-cloud is **v1.9.2** which includes hairpin-proxy removal — no separate decommission needed (was a phase 2.5 in earlier draft). Per-app HTTPRoute parentRefs simplified. New phase 3.5 for HTTP→HTTPS redirect. New phase 5.5 for projection-driven decommission of ingress-nginx.
>
> **Updated 2026-05-18 (post-deploy)** after live's phases 1–3.5 deployed in [#145](https://github.com/CodeForPhilly/cfp-live-cluster/pull/145). Added required manual coredns cleanup to phase 1 (hairpin-proxy-controller patches kube-system coredns ConfigMap at runtime — deletion of the controller doesn't undo what it already wrote). Sharpened the post-deploy DNS verification expectation. Bumped hologit-flake retry expectation for repos with many upstream sources.

## Phase 1: Update cluster-template and prereqs

**Goal**: get the Gateway API foundation in place without affecting current traffic.

1. Bump civic-cloud source ref in this repo from `v1.7.7` → **`v1.9.2`** (latest). This brings:
   - cert-manager **1.13.3 → 1.20.2** (Gateway API integration enabled, `ListenerSets` feature gate enabled)
   - Gateway API **v1.5.1** CRDs (standard channel)
   - Envoy Gateway **v1.7.3** controller (installs to `envoy-gateway-system`)
   - **hairpin-proxy removed** from the LKE blueprint (see note below)
   - Server-side apply for CRDs in the deploy workflow (necessary — gateway-api CRDs are too large for client-side apply)
2. Add `_infra/envoy-gateway/` directory with three files (copy from cfp-sandbox-cluster):
   - `gatewayclass.yaml` — `GatewayClass eg`
   - `envoyproxy.yaml` — `EnvoyProxy shared`, `mergeGateways: true` (essential: one LoadBalancer for all Gateways)
   - `main-gateway.yaml` — `Gateway main-gateway` with HTTP catchall listener, `allowedRoutes.namespaces.from: All` (cert-manager solver target + HTTP path for HTTP→HTTPS redirect)

### Note on hairpin-proxy

The v1.9.2 bump drops hairpin-proxy from the GitOps projection — the deploy workflow's "Apply manifests: deleted resources" step will remove the controller Deployment, haproxy Deployment, RBAC, namespace, and the `coredns-custom` ConfigMap.

**Why it goes away**: Linode LKE now supports LoadBalancer hairpin natively (in-cluster pods can reach the cluster's own LB external IP). hairpin-proxy was the workaround for that limitation. Even more important: with ingress-nginx phasing out, hairpin-proxy's haproxy backend is hardcoded to ingress-nginx — so it was actively misrouting in-cluster traffic away from Envoy and breaking cert-manager HTTP-01 self-checks (this bit us on sandbox before phase 3).

**Verification after phase 1 deploys**: confirm in-cluster DNS now resolves to the external Envoy LB IP, not a ClusterIP:

```bash
KUBECONFIG=~/.kube/cfp-live-cluster-kubeconfig.yaml kubectl run -i --rm test-dns \
  --image=curlimages/curl --restart=Never --quiet --timeout=15s \
  -- sh -c 'nslookup any-live-hostname.live.k8s.phl.io 2>&1 | tail -4'
```

Expect the Address line to be a **public LB IP** (the existing ingress-nginx LB IP if you haven't cut DNS over yet, or the new Envoy LB IP `kubectl get svc -n envoy-gateway-system` once a hostname's DNS has moved). The key thing being verified is that it's NOT a `10.x.x.x` ClusterIP (which would mean hairpin-proxy was still intercepting) and NOT NXDOMAIN (which would mean stale rewrites — see required cleanup below).

**Required cleanup after phase 1 deploys — strip stale rewrites from kube-system coredns** ⚠️

The hairpin-proxy-controller continuously patches the `kube-system/coredns` ConfigMap at runtime, inserting `rewrite name <host> hairpin-proxy.hairpin-proxy.svc.cluster.local # Added by hairpin-proxy` lines for each public hostname into the main Corefile. **Deleting the controller via GitOps does NOT remove what it already wrote** — those patches survive in the live ConfigMap and now point at a deleted Service.

On live this left 14 dead rewrites after the phase-1 deploy. Symptom: in-cluster DNS for those hostnames briefly returns NXDOMAIN; even after it settles, the Corefile is full of stale dead-end rewrites that should be cleaned up.

The `coredns-custom` ConfigMap that gets deleted by the projection is a **separate** file (loaded via `import custom/*.include`) — that one is correctly removed by the GitOps deletion step. The runtime patches into the main Corefile are what need manual cleanup.

```bash
KUBECONFIG=~/.kube/<cluster>-kubeconfig.yaml
# 1. Confirm how many stale rewrites are present
kubectl get cm -n kube-system coredns -o jsonpath='{.data.Corefile}' | grep -c 'Added by hairpin-proxy'

# 2. Back up first
kubectl get cm -n kube-system coredns -o yaml > coredns-cm-backup.yaml

# 3. Patch out the rewrites
CLEANED=$(kubectl get cm -n kube-system coredns -o jsonpath='{.data.Corefile}' | sed '/# Added by hairpin-proxy/d')
kubectl patch cm -n kube-system coredns --type=merge -p "$(jq -nc --arg c "$CLEANED" '{data:{Corefile:$c}}')"

# 4. Wait ~30s for coredns reload plugin to pick it up (no pod restart needed)
sleep 30 && kubectl get cm -n kube-system coredns -o jsonpath='{.data.Corefile}' | grep -c 'Added by hairpin-proxy'  # should print 0
```

The `reload` plugin in the standard LKE Corefile hot-swaps the config without restarting coredns pods. Verify by re-running the DNS test from the previous step.

**Heads-up on cert-manager 1.18 changes**:

- `Certificate.spec.privateKey.rotationPolicy` default changes from `Never` → `Always`. Every Cert without an explicit `rotationPolicy: Never` will rotate keys on next renewal. Standard HTTPS clients don't care; pinned-key clients (none expected here) would break.
- `Certificate.spec.revisionHistoryLimit` default changes from unset → `1`. Old `CertificateRequest` resources get garbage-collected.

**Heads-up on hologit flaky CI**: the `Build k8s-manifests` workflow intermittently fails with `fatal: shallow file has changed since we read it` — concurrent `git fetch --depth=1` race in hologit's source-fetching code. Tracked at [JarvusInnovations/hologit#450](https://github.com/JarvusInnovations/hologit/issues/450). Workaround: rerun the workflow. **Retry count scales with the number of upstream sources** — sandbox typically hit 1-2 retries; live needed 4 in [#145](https://github.com/CodeForPhilly/cfp-live-cluster/pull/145) because it pulls more chart pins (each concurrent fetch has a small race window, so cumulative probability of at least one failure rises with source count).

At the end of phase 1: cluster has new infrastructure (Envoy Gateway + new CRDs + cert-manager 1.20), hairpin-proxy is gone, traffic still flows through ingress-nginx, nothing visibly changes for users.

## Phase 2: Set up a parallel cluster issuer

**Goal**: a new ClusterIssuer using the Gateway-native solver, leaving the existing `letsencrypt-prod` (with the ingress-nginx solver) untouched.

Add new ClusterIssuers, e.g. `letsencrypt-prod-gateway` and `letsencrypt-staging-gateway`, in `_infra/cert-manager/issuers.yaml`:

```yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod-gateway
spec:
  acme:
    email: services@codeforphilly.org
    server: https://acme-v02.api.letsencrypt.org/directory
    privateKeySecretRef:
      name: letsencrypt-prod-gateway
    solvers:
      - http01:
          gatewayHTTPRoute:
            parentRefs:
              - group: gateway.networking.k8s.io
                kind: Gateway
                name: main-gateway
                namespace: envoy-gateway-system
```

**Why parallel and not mutate** (this is the main lesson from the sandbox): mutating the existing `letsencrypt-prod` solver would change renewal behavior for **all** existing Certs, including the Ingress-managed ones. They'd still renew (via the new HTTPRoute solver if DNS reaches Envoy), but it couples paths in a way that's harder to reason about and harder to revert.

Existing Certs continue using `letsencrypt-prod` (the old solver). New Gateway-issued Certs use `letsencrypt-prod-gateway`. Clean separation.

## Phase 3: Provision new Gateway + HTTPRoute resources for every domain

**Goal**: every public hostname has an Envoy Gateway path ready to receive traffic. No DNS changes yet — everything still flows through ingress-nginx in production.

Use `_gateways/` central pile (one file per app). Each file:

```yaml
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: <app>
  namespace: <app>
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod-gateway
spec:
  gatewayClassName: eg
  listeners:
    - name: https
      protocol: HTTPS
      port: 443
      hostname: <host>
      tls:
        mode: Terminate
        certificateRefs:
          - name: <app>-gw-tls
      allowedRoutes:
        namespaces:
          from: Same
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: <app>
  namespace: <app>
spec:
  parentRefs:
    - name: <app>          # per-app HTTPS gateway only — HTTP handled by global redirect (phase 3.5)
  hostnames:
    - <host>
  rules:
    - backendRefs:
        - name: <backend-service>
          port: <port>
```

**Important parentRefs detail**: per-app HTTPRoutes attach **only** to the per-app HTTPS Gateway. They do NOT reference `main-gateway`. HTTP traffic on port 80 is handled by a single global redirect HTTPRoute (phase 3.5).

cert-manager auto-creates its own short-lived HTTPRoute per ACME challenge, with `pathType: Exact` on `/.well-known/acme-challenge/<token>`. That route attaches to both main-gateway and the per-app Gateway (the latter rejects because it has no HTTP listener — fine, the main-gateway attachment carries the validation traffic).

### Domain inventory (from current Ingresses)

| Namespace | Hostname(s) | Notes |
|---|---|---|
| balancer | balancerproject.org | apex domain, no wildcard |
| browserless-chrome | browserless-chrome.live.k8s.phl.io | wildcard-subdomain (if a wildcard exists) |
| chime | penn-chime.phl.io, penn-chime.live.k8s.phl.io | apex + subdomain |
| choose-native-plants | choose-native-plants.live.k8s.phl.io, choosenativeplants.com, <www.choosenativeplants.com> | apex + www + subdomain |
| code-for-philly | codeforphilly.org, <www.codeforphilly.org>, codeforphilly.live.k8s.phl.io | apex + www + subdomain |
| echo-http | echo-http.live.k8s.phl.io | subdomain |
| grafana | metrics.live.k8s.phl.io | subdomain |
| sealed-secrets | sealed-secrets.live.k8s.phl.io | subdomain |
| third-places | third-places.live.k8s.phl.io | subdomain |
| vaultwarden | vaultwarden.phl.io, bitwarden.phl.io | apex + alias |

For multi-hostname apps: one Gateway with multiple HTTPS listeners (one per hostname, each with its own cert), OR one listener with a multi-SAN cert. Per-hostname listener is the clean Gateway-API pattern.

**Cert Secret naming**: use `<app>-gw-tls` suffix to avoid collision with existing Ingress-managed Certs (`<app>-tls`). Both can coexist until each app's Ingress is removed.

**Test with staging issuer first per app** — issue via `letsencrypt-staging-gateway` to validate the whole path (Gateway provisioning, cert-manager solver HTTPRoute creation, ACME challenge routing). Avoids Let's Encrypt rate limits and gives a safe smoke test before producing real certs.

**Note on apex domain ACME challenges**: cert-manager's HTTP-01 solver hits the apex hostname over port 80. For this to reach Envoy, the apex A record needs to point at Envoy's LB OR DNS needs a CNAME flattening that resolves through Envoy. If the apex still points at ingress-nginx (haven't migrated yet), the challenge will fail. **So provisioning the new cert for an apex domain blocks on the DNS cutover for that domain.** Plan order: migrate DNS for apex domains AS the cert is provisioned (DNS cutover and cert issuance happen together for apex domains, separately for wildcard-resolved subdomains).

At the end of phase 3: new Gateway resources exist, all `<app>-gw-tls` Certs are in `Issuing` / pending state with `failed to perform self check ... EOF` because cert-manager's in-cluster self-check resolves each hostname to **whatever the current wildcard or apex A record points at** — and at this point that's still ingress-nginx (no Ingress for the solver path → EOF). Per-hostname certs unblock as DNS cuts over in phase 4. (Note: on sandbox the wildcard was flipped to Envoy mid-migration so non-apex certs issued without per-hostname DNS work, but the safer pattern documented in phase 4 below is per-hostname cutover with the wildcard staying on ingress-nginx until phase 5.) No traffic moved.

## Phase 3.5: Add HTTP→HTTPS redirect

**Goal**: force HTTPS for all traffic reaching Envoy's HTTP listener, without breaking ACME validation.

Add `_infra/envoy-gateway/http-redirect.yaml`:

```yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: http-redirect
  namespace: envoy-gateway-system
spec:
  parentRefs:
    - name: main-gateway
  # No hostnames → matches anything reaching main-gateway's HTTP listener.
  # cert-manager's per-challenge HTTPRoute uses pathType: Exact on
  # /.well-known/acme-challenge/<token> and a hostname filter — both more
  # specific than this rule — so ACME validation traffic bypasses the
  # redirect and reaches the solver Pod.
  rules:
    - filters:
        - type: RequestRedirect
          requestRedirect:
            scheme: https
            statusCode: 301
```

**How this stays compatible with ACME**: Gateway API conflict resolution prefers more-specific HTTPRoutes. cert-manager's solver HTTPRoute has both a hostname filter AND `pathType: Exact` on the challenge URL — both more specific than this rule's "no hostname, default `/` prefix." So validation traffic always wins.

Safe to add anytime after phase 2 (parallel issuer exists). Doesn't depend on per-app Gateways being ready.

## Phase 4: Async DNS migration per hostname

**Goal**: move each hostname to Envoy's LB IP independently, on each project team's timeline.

For each hostname:

1. Verify the corresponding Gateway/HTTPRoute is healthy in the cluster: `kubectl get gateway -n <app>`, listener `Programmed=True`, HTTPRoute `Accepted=True`.
2. Verify cert is issued (`kubectl get cert <app>-gw-tls`, `Ready: True`). For apex domains, this may not be possible until the DNS cutover happens — flip them simultaneously.
3. Update the A record (or CNAME): point hostname at Envoy LB IP.
4. Wait DNS TTL (typically 5 min).
5. Test: `curl https://<host>/` resolves through Envoy; cert is valid; backend responds. `curl http://<host>/` returns `301 → https://<host>/`.
6. **Don't delete the old Ingress yet** — leave it as a fallback. The DNS now bypasses it, but if something breaks, reverting DNS routes back to ingress-nginx + old Ingress.

For wildcard-covered subdomains (`*.live.k8s.phl.io`): the wildcard A record stays pointed at ingress-nginx. Per-hostname migration uses **specific A records that override the wildcard for that hostname**. (Standard DNS behavior — a specific record always wins over wildcard for the same name.)

**Suggested order** (low risk → high risk):

1. `echo-http`, `sealed-secrets` — internal services, low-stakes
2. `metrics.live.k8s.phl.io` (grafana), `third-places`, `browserless-chrome` — moderate
3. `choose-native-plants.live.k8s.phl.io`, `chime` (subdomain first, then apex)
4. `code-for-philly.live.k8s.phl.io` (subdomain) before `codeforphilly.org` + `www`
5. Production apex domains last: `balancerproject.org`, `vaultwarden.phl.io`, `bitwarden.phl.io`, `choosenativeplants.com`, `codeforphilly.org`

For each apex domain, the cert can only issue **after** DNS points at Envoy (HTTP-01 challenge needs to reach Envoy). For these you may want a backout plan ready in case the new path has issues — keep DNS TTLs short during the cutover.

At the end of phase 4: every hostname's DNS points at Envoy; ingress-nginx still serves stale traffic from clients with cached DNS, but no fresh traffic.

## Phase 5: Disable per-app Ingresses

**Goal**: turn off Ingress generation at the source so the next phase removes them via GitOps in one shot.

For each app whose Helm chart/manifests live in this repo, set `ingress.enabled: false` in its release-values.yaml. For kustomize-managed apps (e.g. balancer), drop the Ingress resource from `kustomization.yaml`. For raw-YAML apps (e.g. echo-http), remove the Ingress doc.

Wait at least 1 hour after the last DNS cutover (longest plausible client DNS cache window) before doing this. The Ingresses stop being useful the moment DNS cuts over, but leaving them as fallback during that hour is cheap insurance.

Also identify any Ingresses managed by **external CIs** (e.g. preview-deploy workflows, app-side deploy pipelines). Those need coordination with each project team — they should ship a release that stops creating the Ingress. Once their CI is fixed, `kubectl delete ingress` the orphans.

On sandbox, the external-CI Ingresses were:

- `code-for-philly/latest`, `laddr/latest` (laddr's emergence-site Helm chart)
- `codeforphilly-rewrite-sandbox/codeforphilly` (rewrite project's own kustomize)

Live likely has similar setups. Tracked separately in [#159](https://github.com/CodeForPhilly/cfp-sandbox-cluster/issues/159) for sandbox.

**Side effect for some charts**: disabling the Ingress may also strip auxiliary config that read from `ingress.hosts[0]`. On sandbox this hit:

- grafana: `server.domain` config (used for absolute URLs in emails etc.) — restore via `grafana.ini.server.domain`
- metabase: `MB_SITE_URL` env var (auth callbacks, embeds) — restore via `configs.metabase.MB_SITE_URL`

Spot-check each chart's render diff after setting `ingress.enabled: false` and add direct hostname config if the chart was inferring it from the Ingress.

## Phase 5.5: Decommission ingress-nginx via projection

**Goal**: remove ingress-nginx itself from the cluster.

Add an exclusion to `.holo/branches/k8s-manifests/_civic-cloud.toml` so the ingress-nginx files stop projecting:

```toml
[holomapping]
holosource = "=>k8s-blueprint-lke"
files = [
    "**",
    "!ingress-nginx/**",
    "!.holo/lenses/ingress-nginx.toml",
]
before = "*"
```

**Why per-cluster, not upstream**: cluster-template still ships ingress-nginx for other clusters that haven't migrated. The exclusion lives in each downstream cluster repo at decommission time.

**Both excludes are needed**: the first drops the helm chart input. The second drops the lens config — without it, the helm3 lens errors trying to read `ingress-nginx/` from the (now empty) input tree.

On commit + deploy, 19 resources drop out:

- Cluster-scoped: Namespace, 2× ClusterRole, 2× ClusterRoleBinding, ValidatingWebhookConfiguration, IngressClass
- Namespace-scoped: Deployment, 2× Service (incl. the LoadBalancer — Linode frees the LB), 2× ServiceAccount, 2× Role, 2× RoleBinding, ConfigMap, 2× Job

If phase 5 was done in the same PR, the Ingress resources also disappear at the same time. Otherwise stale Ingresses sit as orphans referencing a non-existent IngressClass — inert, cleanup later.

## Phase 6: Final cleanup

1. Delete the old Ingress-managed Certs (`<app>-tls`) once nothing references them. They'll stop renewing once cert-manager's `letsencrypt-prod` ClusterIssuer is gone, so they'd expire eventually anyway — but cleaning them up sooner is tidier.
2. Delete the old `letsencrypt-prod` ClusterIssuer (or leave for posterity).
3. Remove server-side-apply special handling for the gateway-api CRDs (if any) — they continue working via standard apply now.
4. Audit for stale `<app>-tls` Secrets, orphan Certs from canceled Ingresses, etc.

## Lessons from cfp-sandbox-cluster (do/don't list)

### Do

- **Use parallel ClusterIssuer** instead of mutating the existing one.
- **Use new Secret names (`<app>-gw-tls`)** to avoid collision with existing Ingress-managed Certs.
- **Pre-populate central `_gateways/` pile** before any DNS cuts over. Per-app files delete cleanly when each project ships their own.
- **`mergeGateways: true`** on EnvoyProxy is mandatory. Without it, every Gateway gets its own LB. Cost explodes.
- **Single global HTTP→HTTPS redirect HTTPRoute** on main-gateway (phase 3.5). One resource, no per-app duplication, ACME still works via Gateway API conflict resolution.
- **Verify in-cluster DNS resolves externally** after the phase 1 deploy.
- **Audit helm chart values that read from `ingress.hosts`** (grafana, metabase on sandbox) and set the host directly before disabling the Ingress.
- **Test via `kubectl apply -f <workspace-file>`** before merging — same content as GitOps, no drift, fast feedback. Don't `kubectl apply -f -` from heredocs.
- **Per-hostname HTTPS listeners**, one per app's Gateway. cert-manager + the annotation does the rest.

### Don't

- **Don't add `main-gateway` to per-app HTTPRoute parentRefs.** The HTTP→HTTPS redirect on main-gateway handles all HTTP traffic. Per-app HTTPRoutes only attach to their per-app HTTPS Gateway.
- **Don't flip wildcard DNS in one shot.** On sandbox this broke every hostname that didn't yet have an HTTPRoute. On live, per-hostname DNS cutover is much safer.
- **Don't delete HTTPRoutes manually after a `kubectl apply`** — the apply already updated them. Deletion after means traffic interruption + cert-manager retries that hit transient routing windows. Trust the apply.
- **Don't issue gateway certs before hairpin-proxy is gone.** This is automatic if you follow phases in order (phase 1 deploys → hairpin-proxy goes away → then phase 3 issues certs). The hazard is only if you go out of order and try to issue gateway certs while hairpin-proxy is still routing in-cluster DNS through ingress-nginx — cert-manager's HTTP-01 self-check uses cluster DNS and will fail with "wrong status code '404'" or "got: <!doctype html>" (the app backend's HTML instead of the token).
- **Don't use `--depth=1` plus high source counts without expecting the hologit fetch race.** Reruns are part of life until [JarvusInnovations/hologit#450](https://github.com/JarvusInnovations/hologit/issues/450) lands.
- **Don't expect ListenerSet to be useful yet.** Envoy Gateway v1.7.3 doesn't reconcile ListenerSet resources (lands in v1.8). Stick with per-project Gateways + `mergeGateways`.
- **Don't refresh hologit sources with raw `git fetch <url> <refspec>`** — it auto-pulls upstream tags into local `refs/tags/`, polluting your tag namespace. Use `git holo source fetch <name>` instead.

## Open questions

- **Apex domain ACME**: confirm Let's Encrypt can validate against apex domains via Envoy. (Sandbox didn't have apex domains; only subdomain hosts. Live has many apex domains.) If something goes wrong, fallback is DNS-01 (one wildcard cert per zone, or per-name DNS-01).
- **Stuck-pod inventory for live** — sandbox had two zombies we couldn't address (paws-data-pipeline missing `paws-salesforce` Secret, codeforphilly-rewrite-sandbox in ImagePullBackOff). Audit live for similar conditions before cutting over so they don't get blamed on the migration.
- **External-CI-managed Ingresses on live** — survey upfront which apps' Ingresses come from external CI pipelines (likely laddr, possibly others). Each needs source-side coordination, not just a values change in this repo.

## References

- cfp-sandbox-cluster GitOps state: <https://github.com/CodeForPhilly/cfp-sandbox-cluster>
- Phase 1+2 PR (gateway bones + central HTTPRoutes): <https://github.com/CodeForPhilly/cfp-sandbox-cluster/pull/150>
- Phase 3 PR (per-app HTTPS + gatewayHTTPRoute solver): <https://github.com/CodeForPhilly/cfp-sandbox-cluster/pull/152>
- HTTP→HTTPS redirect PR: <https://github.com/CodeForPhilly/cfp-sandbox-cluster/pull/154>
- hairpin-proxy removal upstream: <https://github.com/JarvusInnovations/cluster-template/pull/63>
- civic-cloud bump pulling in hairpin-proxy removal: <https://github.com/civic-cloud/cluster-template/pull/22>
- civic-cloud bump in cfp-sandbox-cluster: <https://github.com/CodeForPhilly/cfp-sandbox-cluster/pull/156>
- ingress-nginx decommission + per-app Ingress disable PR: <https://github.com/CodeForPhilly/cfp-sandbox-cluster/pull/157>
- Orphan-Ingresses cleanup tracker (sandbox): <https://github.com/CodeForPhilly/cfp-sandbox-cluster/issues/159>
- Hologit shallow-fetch race: <https://github.com/JarvusInnovations/hologit/issues/450>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate to Envoy Gateway: ingress-nginx → Gateway API + cert-manager gatewayHTTPRoute #144

Plan for migrating cfp-live-cluster off ingress-nginx onto Envoy Gateway

Phase 1: Update cluster-template and prereqs

Note on hairpin-proxy

Phase 2: Set up a parallel cluster issuer

Phase 3: Provision new Gateway + HTTPRoute resources for every domain

Domain inventory (from current Ingresses)

Phase 3.5: Add HTTP→HTTPS redirect

Phase 4: Async DNS migration per hostname

Phase 5: Disable per-app Ingresses

Phase 5.5: Decommission ingress-nginx via projection

Phase 6: Final cleanup

Lessons from cfp-sandbox-cluster (do/don't list)

Do

Don't

Open questions

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Namespace	Hostname(s)	Notes
balancer	balancerproject.org	apex domain, no wildcard
browserless-chrome	browserless-chrome.live.k8s.phl.io	wildcard-subdomain (if a wildcard exists)
chime	penn-chime.phl.io, penn-chime.live.k8s.phl.io	apex + subdomain
choose-native-plants	choose-native-plants.live.k8s.phl.io, choosenativeplants.com, <www.choosenativeplants.com>	apex + www + subdomain
code-for-philly	codeforphilly.org, <www.codeforphilly.org>, codeforphilly.live.k8s.phl.io	apex + www + subdomain
echo-http	echo-http.live.k8s.phl.io	subdomain
grafana	metrics.live.k8s.phl.io	subdomain
sealed-secrets	sealed-secrets.live.k8s.phl.io	subdomain
third-places	third-places.live.k8s.phl.io	subdomain
vaultwarden	vaultwarden.phl.io, bitwarden.phl.io	apex + alias

Migrate to Envoy Gateway: ingress-nginx → Gateway API + cert-manager gatewayHTTPRoute #144

Description

Plan for migrating cfp-live-cluster off ingress-nginx onto Envoy Gateway

Phase 1: Update cluster-template and prereqs

Note on hairpin-proxy

Phase 2: Set up a parallel cluster issuer

Phase 3: Provision new Gateway + HTTPRoute resources for every domain

Domain inventory (from current Ingresses)

Phase 3.5: Add HTTP→HTTPS redirect

Phase 4: Async DNS migration per hostname

Phase 5: Disable per-app Ingresses

Phase 5.5: Decommission ingress-nginx via projection

Phase 6: Final cleanup

Lessons from cfp-sandbox-cluster (do/don't list)

Do

Don't

Open questions

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions