Skip to content

Commit 7dab612

Browse files
authored
feat(helm): support Deployment kind in HA gateway workloads (#1867)
* feat(helm): support Deployment kind in HA gateway workloads Signed-off-by: Taylor Mutch <taylormutch@gmail.com> * fix(helm): handle null workload values Signed-off-by: Taylor Mutch <taylormutch@gmail.com> --------- Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
1 parent 4b44d62 commit 7dab612

18 files changed

Lines changed: 391 additions & 189 deletions

File tree

.agents/skills/debug-openshell-cluster/SKILL.md

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ Use gateway metadata, deployment values, or the user's setup notes to identify t
5454
|---|---|
5555
| Docker | Gateway process logs, Docker daemon health, sandbox containers, image pulls. |
5656
| Podman | Podman socket, rootless networking, sandbox containers, image pulls. |
57-
| Kubernetes | Helm release, StatefulSet, service, secrets, sandbox pods, events. |
57+
| Kubernetes | Helm release, gateway workload, service, secrets, sandbox pods, events. |
5858
| VM | VM driver logs, rootfs availability, host virtualization support. |
5959

6060
### Step 3: Check Docker-Backed Gateways
@@ -131,12 +131,17 @@ Common findings:
131131
```bash
132132
helm -n openshell status openshell
133133
helm -n openshell get values openshell
134-
kubectl -n openshell get statefulset,pod,svc,pvc
134+
kubectl -n openshell get deployment,statefulset,pod,svc,pvc
135+
kubectl -n openshell logs deployment/openshell -c openshell-gateway --tail=200
135136
kubectl -n openshell logs statefulset/openshell -c openshell-gateway --tail=200
137+
kubectl -n openshell rollout status deployment/openshell
136138
kubectl -n openshell rollout status statefulset/openshell
137139
```
138140

139-
Look for failed installs, unexpected values, missing namespace, wrong image tag, TLS settings that do not match the registered endpoint, and scheduling failures.
141+
Use the log and rollout commands for the workload kind that exists in the
142+
release. Look for failed installs, unexpected values, missing namespace, wrong
143+
image tag, TLS settings that do not match the registered endpoint, and
144+
scheduling failures.
140145

141146
For HA or PostgreSQL-backed installs, also check the external database Secret
142147
referenced by `server.externalDbSecret` and the PostgreSQL workload if the test
@@ -169,7 +174,7 @@ Secrets but does not create the sandbox JWT signing Secret.
169174

170175
If the gateway exits with `failed to read sandbox JWT signing key from
171176
/etc/openshell-jwt/signing.pem`, verify that `openshell-jwt-keys` contains
172-
`signing.pem`, `public.pem`, and `kid`, and that the StatefulSet mounts the
177+
`signing.pem`, `public.pem`, and `kid`, and that the gateway workload mounts the
173178
`sandbox-jwt` secret at `/etc/openshell-jwt`. The sandbox JWT mount is required
174179
even when local Helm values disable TLS.
175180

@@ -194,8 +199,9 @@ label, supervisor env vars `OPENSHELL_K8S_SA_TOKEN_FILE` and
194199
Check the image references currently used by the gateway deployment:
195200

196201
```bash
202+
kubectl -n openshell get deployment openshell -o jsonpath="{.spec.template.spec.containers[*].image}{\"\n\"}{.spec.template.spec.containers[*].env[?(@.name==\"OPENSHELL_SUPERVISOR_IMAGE\")].value}{\"\n\"}"
197203
kubectl -n openshell get statefulset openshell -o jsonpath="{.spec.template.spec.containers[*].image}{\"\n\"}{.spec.template.spec.containers[*].env[?(@.name==\"OPENSHELL_SUPERVISOR_IMAGE\")].value}{\"\n\"}"
198-
helm -n openshell get values openshell | grep -E 'repository|tag|supervisorImage'
204+
helm -n openshell get values openshell | grep -E 'repository|tag|supervisorImage|workload'
199205
```
200206

201207
The gateway image built from `deploy/docker/Dockerfile.gateway` and the scratch supervisor image built from `deploy/docker/Dockerfile.supervisor` should use the same build tag in branch and E2E deploys. A stale supervisor image can make sandbox behavior lag behind gateway policy or proto changes.
@@ -238,6 +244,7 @@ If the gateway is healthy but sandbox creation fails:
238244
```bash
239245
kubectl -n openshell get pods
240246
kubectl -n openshell get events --sort-by=.lastTimestamp | tail -n 50
247+
kubectl -n openshell logs deployment/openshell -c openshell-gateway --tail=200
241248
kubectl -n openshell logs statefulset/openshell -c openshell-gateway --tail=200
242249
```
243250

@@ -286,7 +293,7 @@ openshell logs <sandbox-name>
286293
| Docker or Podman sandbox never registers | Wrong callback endpoint or supervisor startup failure | Gateway logs and sandbox container logs |
287294
| Docker GPU e2e fails before GPU sandbox comparison | NVIDIA CDI specs are missing or Docker has not discovered them | `docker info --format '{{json .DiscoveredDevices}}'`, `/etc/cdi`, `/var/run/cdi`, `nvidia-cdi-refresh.service` |
288295
| Kubernetes gateway pod pending | PVC unbound, taint, selector, or insufficient resources | `kubectl -n openshell describe pod <pod>` |
289-
| Kubernetes gateway pod crash loops | Missing secret, bad DB URL, bad TLS config | `kubectl -n openshell logs statefulset/openshell -c openshell-gateway` |
296+
| Kubernetes gateway pod crash loops | Missing secret, bad DB URL, bad TLS config | `kubectl -n openshell logs deployment/openshell -c openshell-gateway` or `kubectl -n openshell logs statefulset/openshell -c openshell-gateway` |
290297
| CLI TLS error | Local mTLS bundle does not match server cert/CA | Check `~/.config/openshell/gateways/<name>/mtls/` |
291298
| Image pull failure | Gateway or sandbox image cannot be pulled | Runtime events and image pull credentials |
292299
| `K8s namespace not ready` with `envoy-gateway-openshell.yaml: the server could not find the requested resource` | Optional Gateway API manifest was applied without Envoy Gateway CRDs, or k3s Helm controller startup exceeded the namespace wait | Apply `deploy/kube/manifests/envoy-gateway-openshell.yaml` manually only after Envoy Gateway is installed and `grpcRoute` is enabled |

.agents/skills/openshell-cli/SKILL.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -495,8 +495,9 @@ openshell gateway remove local # Remove local registrati
495495
```bash
496496
# Inspect a Kubernetes Helm release and gateway pod
497497
helm -n openshell status openshell
498-
kubectl -n openshell get pods,svc
499-
kubectl -n openshell logs statefulset/openshell --tail=100
498+
kubectl -n openshell get deployment,statefulset,pods,svc
499+
kubectl -n openshell logs deployment/openshell -c openshell-gateway --tail=100
500+
kubectl -n openshell logs statefulset/openshell -c openshell-gateway --tail=100
500501
```
501502
502503
For Docker, Podman, and VM-backed gateways, inspect the gateway process or container logs and the selected runtime directly.

architecture/compute-runtimes.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -97,8 +97,11 @@ runtime still owns GPU device injection.
9797
## Deployment Shape
9898

9999
Kubernetes deployments use the Helm chart under `deploy/helm/openshell`. The
100-
chart deploys the gateway and sandbox runtime integration, but HA deployments
101-
must point `server.externalDbSecret` at an operator-managed PostgreSQL database.
100+
chart deploys the gateway and sandbox runtime integration. The default gateway
101+
workload is a StatefulSet for SQLite-backed single-replica installs. External
102+
database-backed installs can render a Deployment with `workload.kind=deployment`;
103+
HA deployments must point `server.externalDbSecret` at an operator-managed
104+
PostgreSQL database.
102105
Standalone local deployments start the gateway with a selected runtime such as
103106
Docker, Podman, or VM. The CLI can register multiple gateways and switch between
104107
them without changing the sandbox architecture.

architecture/gateway.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -384,8 +384,8 @@ hook Job using the gateway image itself -- no separate cert-generation image,
384384
no extra mirror burden in air-gapped environments. In the default built-in PKI
385385
path the hook creates TLS and sandbox JWT Secrets. When cert-manager is enabled,
386386
cert-manager owns TLS Secrets and the hook runs with `--jwt-only` so the
387-
required sandbox JWT Secret still exists before the gateway StatefulSet mounts
388-
it, even if `pkiInitJob.enabled` remains true. On package-managed local
387+
required sandbox JWT Secret still exists before the gateway workload mounts it,
388+
even if `pkiInitJob.enabled` remains true. On package-managed local
389389
gateways, the same command runs from the systemd
390390
unit's `ExecStartPre` to bootstrap PKI into the configured local TLS directory
391391
on first start. The Linux package unit defaults that directory to

deploy/helm/openshell/README.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,8 @@ See [`values.yaml`](values.yaml) for source defaults. Selected overlays:
6262

6363
### Database backend
6464

65-
By default, OpenShell uses SQLite:
65+
By default, OpenShell uses SQLite and runs the gateway as a StatefulSet so the
66+
database is backed by a per-pod PVC:
6667

6768
```yaml
6869
server:
@@ -89,9 +90,15 @@ Then install the chart pointing at that Secret:
8990
```bash
9091
helm install openshell oci://ghcr.io/nvidia/openshell/helm-chart --version <version> \
9192
-n openshell \
93+
--set workload.kind=deployment \
9294
--set server.externalDbSecret=my-pg-credentials
9395
```
9496

97+
Use `workload.kind=deployment` for external database-backed multi-replica
98+
gateways. `workload.kind=statefulset` is still available for single-replica
99+
SQLite installs and for operators who explicitly need StatefulSet identity or
100+
storage semantics.
101+
95102
#### OpenShift
96103

97104
Append these flags to any of the PostgreSQL commands above for OpenShift:
@@ -229,6 +236,8 @@ add `ci/values-spire.yaml` to the OpenShell release values files.
229236
| supervisor.image.tag | string | `""` | Supervisor image tag. Defaults to the chart appVersion when empty. |
230237
| supervisor.sideloadMethod | string | `""` | How the supervisor binary is delivered into sandbox pods. Empty (default) = auto-detect from cluster version: K8s >= v1.35 -> "image-volume" (ImageVolume enabled by default; GA in v1.36) K8s < v1.35 -> "init-container" (copies via init container + emptyDir) On K8s v1.33-v1.34 with the ImageVolume feature gate manually enabled, set this to "image-volume" explicitly. |
231238
| tolerations | list | `[]` | Tolerations for the gateway pod. |
239+
| workload.allowMultiReplicaStatefulSet | bool | `false` | Allow replicaCount > 1 while rendering a StatefulSet. Prefer workload.kind=deployment for external database-backed multi-replica gateways; this override exists for operators who explicitly require StatefulSet identity or storage semantics. |
240+
| workload.kind | string | `"statefulset"` | Gateway workload controller kind. Use `statefulset` for the default SQLite database, or `deployment` when server.externalDbSecret points at an external database. |
232241

233242
----------------------------------------------
234243
Autogenerated from chart metadata using [helm-docs v1.14.2](https://github.com/norwoodj/helm-docs/releases/v1.14.2)

deploy/helm/openshell/README.md.gotmpl

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,8 @@ See [`values.yaml`](values.yaml) for source defaults. Selected overlays:
6262

6363
### Database backend
6464

65-
By default, OpenShell uses SQLite:
65+
By default, OpenShell uses SQLite and runs the gateway as a StatefulSet so the
66+
database is backed by a per-pod PVC:
6667

6768
```yaml
6869
server:
@@ -89,9 +90,15 @@ Then install the chart pointing at that Secret:
8990
```bash
9091
helm install openshell oci://ghcr.io/nvidia/openshell/helm-chart --version <version> \
9192
-n openshell \
93+
--set workload.kind=deployment \
9294
--set server.externalDbSecret=my-pg-credentials
9395
```
9496

97+
Use `workload.kind=deployment` for external database-backed multi-replica
98+
gateways. `workload.kind=statefulset` is still available for single-replica
99+
SQLite installs and for operators who explicitly need StatefulSet identity or
100+
storage semantics.
101+
95102
#### OpenShift
96103

97104
Append these flags to any of the PostgreSQL commands above for OpenShift:

deploy/helm/openshell/ci/values-high-availability.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,5 +6,8 @@
66
# overlay expects the caller to provide a PostgreSQL Secret named openshell-ha-pg.
77
replicaCount: 2
88

9+
workload:
10+
kind: deployment
11+
912
server:
1013
externalDbSecret: openshell-ha-pg
Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
{{/*
5+
Gateway pod template shared by the StatefulSet and Deployment workload shapes.
6+
*/}}
7+
{{- define "openshell.gatewayPodTemplate" -}}
8+
metadata:
9+
annotations:
10+
# Roll the gateway workload when the rendered gateway TOML changes - the
11+
# gateway only reads /etc/openshell/gateway.toml at startup, so without
12+
# this annotation a `helm upgrade` that only mutates the ConfigMap would
13+
# leave pods running with stale config.
14+
checksum/gateway-config: {{ include (print $.Template.BasePath "/gateway-config.yaml") . | sha256sum }}
15+
{{- with .Values.podAnnotations }}
16+
{{- toYaml . | nindent 4 }}
17+
{{- end }}
18+
labels:
19+
{{- include "openshell.labels" . | nindent 4 }}
20+
{{- with .Values.podLabels }}
21+
{{- toYaml . | nindent 4 }}
22+
{{- end }}
23+
spec:
24+
terminationGracePeriodSeconds: {{ .Values.podLifecycle.terminationGracePeriodSeconds }}
25+
{{- with .Values.imagePullSecrets }}
26+
imagePullSecrets:
27+
{{- toYaml . | nindent 4 }}
28+
{{- end }}
29+
serviceAccountName: {{ include "openshell.serviceAccountName" . }}
30+
{{- if .Values.server.hostGatewayIP }}
31+
hostAliases:
32+
- ip: {{ .Values.server.hostGatewayIP | quote }}
33+
hostnames:
34+
- host.docker.internal
35+
- host.openshell.internal
36+
{{- end }}
37+
securityContext:
38+
{{- toYaml .Values.podSecurityContext | nindent 4 }}
39+
containers:
40+
- name: openshell-gateway
41+
securityContext:
42+
{{- toYaml .Values.securityContext | nindent 8 }}
43+
image: {{ include "openshell.image" . | quote }}
44+
imagePullPolicy: {{ .Values.image.pullPolicy }}
45+
args:
46+
- --config
47+
- /etc/openshell/gateway.toml
48+
{{- if not .Values.server.externalDbSecret }}
49+
- --db-url
50+
- {{ .Values.server.dbUrl | quote }}
51+
{{- end }}
52+
env:
53+
{{- if .Values.server.externalDbSecret }}
54+
- name: OPENSHELL_DB_URL
55+
valueFrom:
56+
secretKeyRef:
57+
name: {{ .Values.server.externalDbSecret }}
58+
key: uri
59+
{{- end }}
60+
# All gateway settings live in the ConfigMap-backed TOML file
61+
# mounted at /etc/openshell/gateway.toml. The only env var below
62+
# is a process-level setting consumed by libraries outside
63+
# gateway code (currently just SSL_CERT_FILE for OIDC issuer TLS).
64+
{{- if and .Values.server.oidc.issuer .Values.server.oidc.caConfigMapName }}
65+
# OIDC issuer custom-CA: rustls/reqwest read SSL_CERT_FILE for
66+
# outbound TLS verification. This is a process-level env var
67+
# consumed by the TLS stack itself, not by gateway code, so it
68+
# cannot be represented in the gateway TOML schema.
69+
- name: SSL_CERT_FILE
70+
value: /etc/openshell-tls/oidc-ca/ca.crt
71+
{{- end }}
72+
volumeMounts:
73+
{{- if eq (include "openshell.workloadKind" .) "statefulset" }}
74+
- name: openshell-data
75+
mountPath: /var/openshell
76+
{{- end }}
77+
- name: gateway-config
78+
mountPath: /etc/openshell
79+
readOnly: true
80+
- name: sandbox-jwt
81+
mountPath: /etc/openshell-jwt
82+
readOnly: true
83+
{{- if not .Values.server.disableTls }}
84+
- name: tls-cert
85+
mountPath: /etc/openshell-tls/server
86+
readOnly: true
87+
{{- if or .Values.server.tls.clientCaSecretName (and .Values.pkiInitJob.enabled (not .Values.certManager.enabled)) (and .Values.certManager.enabled .Values.certManager.clientCaFromServerTlsSecret) }}
88+
- name: tls-client-ca
89+
mountPath: /etc/openshell-tls/client-ca
90+
readOnly: true
91+
{{- end }}
92+
{{- end }}
93+
{{- if and .Values.server.oidc.issuer .Values.server.oidc.caConfigMapName }}
94+
- name: oidc-ca
95+
mountPath: /etc/openshell-tls/oidc-ca
96+
readOnly: true
97+
{{- end }}
98+
ports:
99+
- name: grpc
100+
containerPort: {{ .Values.service.port }}
101+
protocol: TCP
102+
- name: health
103+
containerPort: {{ .Values.service.healthPort }}
104+
protocol: TCP
105+
{{- if .Values.service.metricsPort }}
106+
- name: metrics
107+
containerPort: {{ .Values.service.metricsPort }}
108+
protocol: TCP
109+
{{- end }}
110+
startupProbe:
111+
httpGet:
112+
path: /healthz
113+
port: health
114+
periodSeconds: {{ .Values.probes.startup.periodSeconds }}
115+
timeoutSeconds: {{ .Values.probes.startup.timeoutSeconds }}
116+
failureThreshold: {{ .Values.probes.startup.failureThreshold }}
117+
livenessProbe:
118+
httpGet:
119+
path: /healthz
120+
port: health
121+
initialDelaySeconds: {{ .Values.probes.liveness.initialDelaySeconds }}
122+
periodSeconds: {{ .Values.probes.liveness.periodSeconds }}
123+
timeoutSeconds: {{ .Values.probes.liveness.timeoutSeconds }}
124+
failureThreshold: {{ .Values.probes.liveness.failureThreshold }}
125+
readinessProbe:
126+
httpGet:
127+
path: /readyz
128+
port: health
129+
initialDelaySeconds: {{ .Values.probes.readiness.initialDelaySeconds }}
130+
periodSeconds: {{ .Values.probes.readiness.periodSeconds }}
131+
timeoutSeconds: {{ .Values.probes.readiness.timeoutSeconds }}
132+
failureThreshold: {{ .Values.probes.readiness.failureThreshold }}
133+
resources:
134+
{{- toYaml .Values.resources | nindent 8 }}
135+
volumes:
136+
- name: gateway-config
137+
configMap:
138+
name: {{ include "openshell.fullname" . }}-config
139+
- name: sandbox-jwt
140+
secret:
141+
secretName: {{ include "openshell.sandboxJwtSecretName" . }}
142+
defaultMode: {{ .Values.server.sandboxJwt.secretDefaultMode | default 0400 }}
143+
{{- if not .Values.server.disableTls }}
144+
- name: tls-cert
145+
secret:
146+
secretName: {{ .Values.server.tls.certSecretName }}
147+
{{- if or .Values.server.tls.clientCaSecretName (and .Values.pkiInitJob.enabled (not .Values.certManager.enabled)) (and .Values.certManager.enabled .Values.certManager.clientCaFromServerTlsSecret) }}
148+
- name: tls-client-ca
149+
secret:
150+
{{- if or (and .Values.pkiInitJob.enabled (not .Values.certManager.enabled)) (and .Values.certManager.enabled .Values.certManager.clientCaFromServerTlsSecret) }}
151+
secretName: {{ .Values.server.tls.certSecretName }}
152+
items:
153+
- key: ca.crt
154+
path: ca.crt
155+
{{- else }}
156+
secretName: {{ .Values.server.tls.clientCaSecretName }}
157+
{{- end }}
158+
{{- end }}
159+
{{- end }}
160+
{{- if and .Values.server.oidc.issuer .Values.server.oidc.caConfigMapName }}
161+
- name: oidc-ca
162+
configMap:
163+
name: {{ .Values.server.oidc.caConfigMapName }}
164+
{{- end }}
165+
{{- with .Values.nodeSelector }}
166+
nodeSelector:
167+
{{- toYaml . | nindent 4 }}
168+
{{- end }}
169+
{{- with .Values.affinity }}
170+
affinity:
171+
{{- toYaml . | nindent 4 }}
172+
{{- end }}
173+
{{- with .Values.tolerations }}
174+
tolerations:
175+
{{- toYaml . | nindent 4 }}
176+
{{- end }}
177+
{{- end }}

deploy/helm/openshell/templates/_helpers.tpl

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -144,14 +144,38 @@ init-container
144144
{{- end -}}
145145
{{- end }}
146146

147+
{{/*
148+
Gateway workload kind. StatefulSet is the default because the default SQLite
149+
database requires persistent per-pod storage.
150+
*/}}
151+
{{- define "openshell.workloadKind" -}}
152+
{{- $workload := .Values.workload | default dict -}}
153+
{{- if not (kindIs "map" $workload) -}}
154+
{{- fail "workload must be a map with kind and allowMultiReplicaStatefulSet fields." -}}
155+
{{- end -}}
156+
{{- default "statefulset" (get $workload "kind") | lower -}}
157+
{{- end }}
158+
147159
{{/*
148160
Validate chart values that Helm would otherwise accept silently.
149161
*/}}
150162
{{- define "openshell.validateValues" -}}
163+
{{- $workloadKind := include "openshell.workloadKind" . -}}
164+
{{- $workload := .Values.workload | default dict -}}
165+
{{- $replicaCount := int (default 1 .Values.replicaCount) -}}
151166
{{- if and (hasKey .Values "postgres") (kindIs "map" .Values.postgres) (hasKey .Values.postgres "enabled") -}}
152167
{{- fail "postgres.enabled was removed; the OpenShell chart no longer deploys PostgreSQL. Provision PostgreSQL separately and set server.externalDbSecret to a Secret containing a PostgreSQL URI." -}}
153168
{{- end -}}
154-
{{- if and (gt (int (default 1 .Values.replicaCount)) 1) (not .Values.server.externalDbSecret) -}}
169+
{{- if not (or (eq $workloadKind "statefulset") (eq $workloadKind "deployment")) -}}
170+
{{- fail "workload.kind must be one of: statefulset, deployment." -}}
171+
{{- end -}}
172+
{{- if and (eq $workloadKind "deployment") (not .Values.server.externalDbSecret) -}}
173+
{{- fail "workload.kind=deployment requires server.externalDbSecret; use workload.kind=statefulset for the default SQLite database." -}}
174+
{{- end -}}
175+
{{- if and (gt $replicaCount 1) (not .Values.server.externalDbSecret) -}}
155176
{{- fail "replicaCount > 1 requires server.externalDbSecret; multiple gateway replicas cannot share the default per-pod SQLite database." -}}
156177
{{- end -}}
178+
{{- if and (eq $workloadKind "statefulset") (gt $replicaCount 1) (not (get $workload "allowMultiReplicaStatefulSet" | default false)) -}}
179+
{{- fail "replicaCount > 1 with workload.kind=statefulset requires workload.allowMultiReplicaStatefulSet=true; use workload.kind=deployment for external database-backed multi-replica gateways." -}}
180+
{{- end -}}
157181
{{- end }}

0 commit comments

Comments
 (0)