Skip to content

Commit 269dbc6

Browse files
authored
ci(kubernetes): add HA e2e workflow (#1598)
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
1 parent 28ee296 commit 269dbc6

13 files changed

Lines changed: 162 additions & 14 deletions

File tree

.agents/skills/debug-openshell-cluster/SKILL.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,15 @@ kubectl -n openshell rollout status statefulset/openshell
138138

139139
Look for failed installs, unexpected values, missing namespace, wrong image tag, TLS settings that do not match the registered endpoint, and scheduling failures.
140140

141+
For HA or PostgreSQL-backed installs, also check the service-binding Secret and
142+
bundled PostgreSQL workload:
143+
144+
```bash
145+
kubectl -n openshell get secret -l app.kubernetes.io/instance=openshell
146+
kubectl -n openshell get statefulset,pod,pvc -l app.kubernetes.io/instance=openshell
147+
kubectl -n openshell logs statefulset/openshell-postgres --tail=200
148+
```
149+
141150
Check required Helm deployment secrets:
142151

143152
```bash

.agents/skills/helm-dev-environment/SKILL.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
name: helm-dev-environment
3-
description: Start up, tear down, and configure the local Kubernetes development environment for OpenShell. Uses k3d (Docker-backed k3s) + Skaffold + Helm. Covers cluster lifecycle, optional add-ons (Keycloak OIDC, Envoy Gateway), and port mappings. Trigger keywords - local k8s, local cluster, k3d, skaffold, helm dev, start cluster, stop cluster, tear down cluster, delete cluster, create cluster, helm:k3s, helm:skaffold, local dev environment, dev cluster, k8s dev, envoy gateway local, keycloak local.
3+
description: Start up, tear down, and configure the local Kubernetes development environment for OpenShell. Uses k3d (Docker-backed k3s) + Skaffold + Helm. Covers cluster lifecycle, optional add-ons (Keycloak OIDC, Envoy Gateway), HA testing, and port mappings. Trigger keywords - local k8s, local cluster, k3d, skaffold, helm dev, start cluster, stop cluster, tear down cluster, delete cluster, create cluster, helm:k3s, helm:skaffold, local dev environment, dev cluster, k8s dev, envoy gateway local, keycloak local, high availability, HA.
44
---
55

66
# Helm Dev Environment
@@ -65,6 +65,10 @@ generates mTLS secrets on first install. Envoy Gateway opt-in; see the Optional
6565

6666
The gateway Service uses ClusterIP. Access is via Envoy Gateway (port `8080`) or `kubectl port-forward`.
6767

68+
**HA test deploy** (two gateway replicas + bundled PostgreSQL): uncomment
69+
`#- ci/values-high-availability.yaml` in `deploy/helm/openshell/skaffold.yaml`,
70+
then run `mise run helm:skaffold:run` or `mise run helm:skaffold:dev`.
71+
6872
### TLS behaviour
6973

7074
`ci/values-skaffold.yaml` sets `server.disableTls: true`, so Skaffold-based deploys run
@@ -198,6 +202,7 @@ mise run helm:k3s:status
198202
| `deploy/helm/openshell/ci/values-skaffold.yaml` | Dev overrides (image pull policy, TLS disabled for local Skaffold) |
199203
| `deploy/helm/openshell/ci/values-cert-manager.yaml` | cert-manager PKI overlay (opt-in; disables pkiInitJob) |
200204
| `deploy/helm/openshell/ci/values-gateway.yaml` | Envoy Gateway GRPCRoute + Gateway overlay |
205+
| `deploy/helm/openshell/ci/values-high-availability.yaml` | HA test overlay (`replicaCount: 2` with bundled PostgreSQL) |
201206
| `deploy/helm/openshell/ci/values-keycloak.yaml` | Keycloak OIDC overlay |
202207
| `deploy/helm/openshell/ci/values-tls-disabled.yaml` | Lint-only: TLS + auth disabled (reverse-proxy edge termination) |
203208
| `deploy/kube/manifests/envoy-gateway-openshell.yaml` | GatewayClass for Envoy Gateway (`mise run helm:gateway:apply`) |

.github/workflows/branch-e2e.yml

Lines changed: 43 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ jobs:
2323
should_run: ${{ steps.gate.outputs.should_run }}
2424
run_core_e2e: ${{ steps.labels.outputs.run_core_e2e }}
2525
run_gpu_e2e: ${{ steps.labels.outputs.run_gpu_e2e }}
26+
run_kubernetes_ha_e2e: ${{ steps.labels.outputs.run_kubernetes_ha_e2e }}
2627
run_any_e2e: ${{ steps.labels.outputs.run_any_e2e }}
2728
steps:
2829
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
@@ -39,24 +40,27 @@ jobs:
3940
if [ "$EVENT_NAME" != "push" ]; then
4041
run_core_e2e=true
4142
run_gpu_e2e=true
43+
run_kubernetes_ha_e2e=true
4244
else
4345
run_core_e2e="$(jq -r 'index("test:e2e") != null' <<< "$LABELS_JSON")"
4446
run_gpu_e2e="$(jq -r 'index("test:e2e-gpu") != null' <<< "$LABELS_JSON")"
47+
run_kubernetes_ha_e2e="$(jq -r 'index("test:e2e-kubernetes") != null' <<< "$LABELS_JSON")"
4548
fi
46-
if [ "$run_core_e2e" = "true" ] || [ "$run_gpu_e2e" = "true" ]; then
49+
if [ "$run_core_e2e" = "true" ] || [ "$run_gpu_e2e" = "true" ] || [ "$run_kubernetes_ha_e2e" = "true" ]; then
4750
run_any_e2e=true
4851
else
4952
run_any_e2e=false
5053
fi
5154
{
5255
echo "run_core_e2e=$run_core_e2e"
5356
echo "run_gpu_e2e=$run_gpu_e2e"
57+
echo "run_kubernetes_ha_e2e=$run_kubernetes_ha_e2e"
5458
echo "run_any_e2e=$run_any_e2e"
5559
} >> "$GITHUB_OUTPUT"
5660
5761
build-gateway:
5862
needs: [pr_metadata]
59-
if: needs.pr_metadata.outputs.should_run == 'true' && needs.pr_metadata.outputs.run_core_e2e == 'true'
63+
if: needs.pr_metadata.outputs.should_run == 'true' && (needs.pr_metadata.outputs.run_core_e2e == 'true' || needs.pr_metadata.outputs.run_kubernetes_ha_e2e == 'true')
6064
permissions:
6165
contents: read
6266
packages: write
@@ -107,6 +111,16 @@ jobs:
107111
with:
108112
image-tag: ${{ github.sha }}
109113

114+
kubernetes-ha-e2e:
115+
needs: [pr_metadata, build-gateway, build-supervisor]
116+
if: needs.pr_metadata.outputs.should_run == 'true' && needs.pr_metadata.outputs.run_kubernetes_ha_e2e == 'true'
117+
permissions:
118+
contents: read
119+
packages: read
120+
uses: ./.github/workflows/e2e-kubernetes-ha-test.yml
121+
with:
122+
image-tag: ${{ github.sha }}
123+
110124
core-e2e-result:
111125
name: Core E2E result
112126
needs: [pr_metadata, build-gateway, build-supervisor, e2e, kubernetes-e2e]
@@ -160,3 +174,30 @@ jobs:
160174
fi
161175
done
162176
exit "$failed"
177+
178+
kubernetes-ha-e2e-result:
179+
name: Kubernetes HA E2E result
180+
needs: [pr_metadata, build-gateway, build-supervisor, kubernetes-ha-e2e]
181+
if: always() && needs.pr_metadata.outputs.should_run == 'true' && needs.pr_metadata.outputs.run_kubernetes_ha_e2e == 'true'
182+
runs-on: ubuntu-latest
183+
steps:
184+
- name: Verify Kubernetes HA E2E jobs
185+
env:
186+
BUILD_GATEWAY_RESULT: ${{ needs.build-gateway.result }}
187+
BUILD_SUPERVISOR_RESULT: ${{ needs.build-supervisor.result }}
188+
KUBERNETES_HA_E2E_RESULT: ${{ needs.kubernetes-ha-e2e.result }}
189+
run: |
190+
set -euo pipefail
191+
failed=0
192+
for item in \
193+
"build-gateway:$BUILD_GATEWAY_RESULT" \
194+
"build-supervisor:$BUILD_SUPERVISOR_RESULT" \
195+
"kubernetes-ha-e2e:$KUBERNETES_HA_E2E_RESULT"; do
196+
name="${item%%:*}"
197+
result="${item#*:}"
198+
if [ "$result" != "success" ]; then
199+
echo "::error::$name concluded $result"
200+
failed=1
201+
fi
202+
done
203+
exit "$failed"
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
name: Kubernetes HA E2E Test
2+
3+
on:
4+
workflow_call:
5+
inputs:
6+
image-tag:
7+
description: "Image tag to test (typically the commit SHA)"
8+
required: true
9+
type: string
10+
runner:
11+
description: "GitHub Actions runner label"
12+
required: false
13+
type: string
14+
default: "linux-amd64-cpu8"
15+
checkout-ref:
16+
description: "Git ref to check out for test inputs (defaults to the workflow SHA)"
17+
required: false
18+
type: string
19+
default: ""
20+
21+
permissions:
22+
contents: read
23+
packages: read
24+
25+
jobs:
26+
e2e-kubernetes-ha:
27+
name: Kubernetes HA E2E
28+
permissions:
29+
contents: read
30+
packages: read
31+
uses: ./.github/workflows/e2e-kubernetes-test.yml
32+
secrets: inherit
33+
with:
34+
image-tag: ${{ inputs.image-tag }}
35+
runner: ${{ inputs.runner }}
36+
checkout-ref: ${{ inputs.checkout-ref }}
37+
extra-helm-values: deploy/helm/openshell/ci/values-high-availability.yaml

.github/workflows/e2e-kubernetes-test.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,11 @@ on:
1717
required: false
1818
type: string
1919
default: ""
20+
extra-helm-values:
21+
description: "Colon-separated Helm values files to layer on the Kubernetes e2e chart install"
22+
required: false
23+
type: string
24+
default: ""
2025

2126
permissions:
2227
contents: read
@@ -93,6 +98,7 @@ jobs:
9398
- name: Run Kubernetes E2E (Rust smoke)
9499
env:
95100
OPENSHELL_E2E_KUBE_CONTEXT: kind-${{ env.KIND_CLUSTER_NAME }}
101+
OPENSHELL_E2E_KUBE_EXTRA_VALUES: ${{ inputs.extra-helm-values }}
96102
IMAGE_TAG: ${{ inputs.image-tag }}
97103
OPENSHELL_REGISTRY: ghcr.io/nvidia/openshell
98104
run: mise run --no-deps --skip-deps e2e:kubernetes

.github/workflows/e2e-label-help.yml

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ permissions: {}
1919
jobs:
2020
hint:
2121
name: Post next-step hint for E2E label
22-
if: github.event.label.name == 'test:e2e' || github.event.label.name == 'test:e2e-gpu'
22+
if: github.event.label.name == 'test:e2e' || github.event.label.name == 'test:e2e-gpu' || github.event.label.name == 'test:e2e-kubernetes'
2323
runs-on: ubuntu-latest
2424
permissions:
2525
pull-requests: write
@@ -43,10 +43,17 @@ jobs:
4343
test:e2e)
4444
suite_summary="the standard E2E suite"
4545
build_summary="gateway and supervisor images"
46+
status_summary="The matching required CI gate status on this PR will flip green automatically once the run finishes."
4647
;;
4748
test:e2e-gpu)
4849
suite_summary="GPU E2E"
4950
build_summary="supervisor image"
51+
status_summary="The matching required CI gate status on this PR will flip green automatically once the run finishes."
52+
;;
53+
test:e2e-kubernetes)
54+
suite_summary="Kubernetes HA E2E"
55+
build_summary="gateway and supervisor images"
56+
status_summary="This is an optional proof-of-life suite; failures are visible in the workflow run but do not publish a required CI gate status."
5057
;;
5158
*) echo "Unrecognized label $LABEL_NAME"; exit 1 ;;
5259
esac
@@ -69,7 +76,7 @@ jobs:
6976
workflow_link="[$workflow_name](https://github.com/$GH_REPO/actions/workflows/$workflow_file)"
7077
instructions="Open $workflow_link, find the run for commit \`$short_pr\`, and click **Re-run all jobs** to execute with the label set."
7178
fi
72-
body="Label \`$LABEL_NAME\` applied for \`$short_pr\`. $instructions The run will execute $suite_summary after building the required $build_summary once. The matching required CI gate status on this PR will flip green automatically once the run finishes."
79+
body="Label \`$LABEL_NAME\` applied for \`$short_pr\`. $instructions The run will execute $suite_summary after building the required $build_summary once. $status_summary"
7380
fi
7481
7582
gh pr comment "$PR_NUMBER" --body "$body"

CI.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,15 @@ PR CI that runs on NVIDIA self-hosted runners uses NVIDIA's copy-pr-bot. The bot
1010

1111
`Branch Checks` run automatically after copy-pr-bot mirrors the PR. `Required CI Gates` posts PR-head statuses that verify the mirror exists, is current, and ran the expected push-based workflows. E2E suites are opt-in because they are more expensive and publish temporary images.
1212

13-
Two opt-in labels enable the long-running E2E suites:
13+
Three opt-in labels enable the long-running E2E suites:
1414

1515
- `test:e2e` runs the standard E2E suite in `Branch E2E Checks`
1616
- `test:e2e-gpu` runs GPU E2E in `Branch E2E Checks`
17+
- `test:e2e-kubernetes` runs Kubernetes E2E with the HA Helm overlay
18+
(`replicaCount: 2` and bundled PostgreSQL) in `Branch E2E Checks`
1719

18-
When both labels are present, `Branch E2E Checks` builds the shared gateway and supervisor images once and fans out all enabled suites in parallel.
19-
The `OpenShell / E2E` and `OpenShell / GPU E2E` required statuses are evaluated from separate suite result jobs inside that workflow, so the expensive GPU suite stays independently gated.
20+
When multiple labels are present, `Branch E2E Checks` builds the shared gateway and supervisor images once and fans out all enabled suites in parallel.
21+
The `OpenShell / E2E` and `OpenShell / GPU E2E` required statuses are evaluated from separate suite result jobs inside that workflow. `test:e2e-kubernetes` is optional while HA behavior is under active iteration: failures are visible in the workflow run but do not publish a required CI gate status.
2022

2123
The GitHub ruleset should require the `OpenShell / ...` statuses published by `Required CI Gates`, not the push-triggered workflow jobs directly.
2224

@@ -69,7 +71,7 @@ Flow:
6971

7072
1. Open the PR. copy-pr-bot mirrors it to `pull-request/<N>` automatically.
7173
2. The mirror push runs `Branch Checks` automatically. `Required CI Gates` keeps the PR blocked until the mirror exists, matches the PR head SHA, and the required push-based workflow succeeds. The first `Branch E2E Checks` run only resolves metadata and skips expensive jobs unless an E2E label is already set.
72-
3. A maintainer applies `test:e2e` and/or `test:e2e-gpu`. `E2E Label Help` posts a comment with a link to the existing gated workflow run.
74+
3. A maintainer applies `test:e2e`, `test:e2e-gpu`, and/or `test:e2e-kubernetes`. `E2E Label Help` posts a comment with a link to the existing gated workflow run.
7375
4. The maintainer opens that link and clicks **Re-run all jobs**. This time `pr_metadata` sees the label and the build/E2E jobs run.
7476
5. When the run finishes, the matching `OpenShell / ...` gate status flips to green automatically.
7577
6. New commits push to the mirror automatically and re-trigger `Branch Checks` plus any labeled E2E jobs in `Branch E2E Checks`.
@@ -108,7 +110,7 @@ The bot's full administrator documentation is internal to NVIDIA. The only comma
108110
| File | Role |
109111
|---|---|
110112
| `.github/workflows/branch-checks.yml` | Required non-E2E PR checks. Triggers on `push: pull-request/[0-9]+`. |
111-
| `.github/workflows/branch-e2e.yml` | Opt-in standard and GPU E2E. Triggers on `push: pull-request/[0-9]+` and runs jobs selected by `test:e2e` / `test:e2e-gpu`. |
113+
| `.github/workflows/branch-e2e.yml` | Opt-in standard, GPU, and Kubernetes HA E2E. Triggers on `push: pull-request/[0-9]+` and runs jobs selected by `test:e2e`, `test:e2e-gpu`, or `test:e2e-kubernetes`. |
112114
| `.github/workflows/helm-lint.yml` | Helm chart validation. Triggers on `push: pull-request/[0-9]+` and skips lint jobs unless Helm inputs changed. |
113115
| `.github/actions/pr-gate/action.yml` | Composite action that resolves PR metadata and verifies the required label is set. |
114116
| `.github/actions/pr-merge-base/action.yml` | Composite action that resolves and fetches the merge-base commit for `pull-request/<N>` push workflows. |

CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -302,4 +302,4 @@ DCO sign-off is separate from cryptographic commit signing. CI requires signing
302302

303303
## CI
304304

305-
How PR CI runs, the `test:e2e` / `test:e2e-gpu` labels, copy-pr-bot, and commit-signing setup are documented in [CI.md](CI.md).
305+
How PR CI runs, the `test:e2e`, `test:e2e-gpu`, and `test:e2e-kubernetes` labels, copy-pr-bot, and commit-signing setup are documented in [CI.md](CI.md).

deploy/helm/openshell/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ See [`values.yaml`](values.yaml) for source defaults. Selected overlays:
5656
- [`ci/values-gateway.yaml`](ci/values-gateway.yaml) - gateway-only configuration
5757
- [`ci/values-cert-manager.yaml`](ci/values-cert-manager.yaml) - cert-manager integration
5858
- [`ci/values-keycloak.yaml`](ci/values-keycloak.yaml) - Keycloak OIDC integration
59+
- [`ci/values-high-availability.yaml`](ci/values-high-availability.yaml) - HA gateway test overlay with bundled PostgreSQL
5960

6061
### Database backend
6162

deploy/helm/openshell/README.md.gotmpl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ See [`values.yaml`](values.yaml) for source defaults. Selected overlays:
5656
- [`ci/values-gateway.yaml`](ci/values-gateway.yaml) - gateway-only configuration
5757
- [`ci/values-cert-manager.yaml`](ci/values-cert-manager.yaml) - cert-manager integration
5858
- [`ci/values-keycloak.yaml`](ci/values-keycloak.yaml) - Keycloak OIDC integration
59+
- [`ci/values-high-availability.yaml`](ci/values-high-availability.yaml) - HA gateway test overlay with bundled PostgreSQL
5960

6061
### Database backend
6162

0 commit comments

Comments
 (0)