Skip to content

Commit 5f58cb0

Browse files
authored
fix(helm): create sandbox JWT secret when cert-manager is enabled (#1700)
* fix(helm): create sandbox JWT secret under cert-manager The cert-manager install path (certManager.enabled=true, pkiInitJob.enabled=false) left the gateway StatefulSet unable to start because nothing created the openshell-jwt-keys Secret: cert-manager owns TLS Secrets but does not mint the sandbox JWT signing key, and the certgen hook only rendered when pkiInitJob.enabled was true. Separate JWT signing-key provisioning from TLS PKI provisioning: - certgen: add a --jwt-only mode that creates only the Opaque JWT signing Secret, for use when another controller owns TLS Secrets. - certgen.yaml: render the hook when pkiInitJob.enabled OR certManager.enabled is true. cert-manager takes precedence and runs the hook with --jwt-only even if pkiInitJob.enabled remains true. Remove the mutual-exclusion failure between the two values. - _helpers.tpl: add openshell.sandboxJwtSecretName, shared by the hook and the StatefulSet mount. - Update values, README, docs, architecture, and the debug-openshell-cluster skill to reflect the new precedence; the documented cert-manager install no longer needs pkiInitJob.enabled=false. Closes #1691 * fix(helm): honor cert-manager precedence for client CA volume The client CA volume logic treated pkiInitJob.enabled as proof that built-in PKI owns the client CA. With cert-manager precedence now allowing certManager.enabled=true alongside the default pkiInitJob.enabled=true, that assumption mounts the server TLS cert secret as the client CA and ignores certManager.clientCaFromServerTlsSecret=false, which can break mTLS or trust the wrong CA. Gate the pkiInitJob.enabled term with (not certManager.enabled) in all three client CA conditions (volume mount, volume definition, and secret selection) so cert-manager owns TLS when enabled. Add a Helm test suite covering built-in PKI, cert-manager shared CA, the regression config (cert-manager + clientCaFromServerTlsSecret=false + default pkiInitJob), and the no-client-CA case.
1 parent e4bcfdf commit 5f58cb0

16 files changed

Lines changed: 422 additions & 106 deletions

File tree

.agents/skills/debug-openshell-cluster/SKILL.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,15 @@ kubectl -n openshell get secret \
157157
openshell-jwt-keys
158158
```
159159

160+
In cert-manager installs, `certManager.enabled=true` makes cert-manager own TLS
161+
generation. The Helm chart should still render the `openshell-certgen`
162+
pre-install/pre-upgrade hook in JWT-only mode to create `openshell-jwt-keys`,
163+
even if `pkiInitJob.enabled` remains true.
164+
If the gateway pod is pending with `MountVolume.SetUp failed for volume
165+
"sandbox-jwt"` and `openshell-jwt-keys` is absent, inspect the rendered
166+
`templates/certgen.yaml` output and the hook Job logs; cert-manager creates TLS
167+
Secrets but does not create the sandbox JWT signing Secret.
168+
160169
If the gateway exits with `failed to read sandbox JWT signing key from
161170
/etc/openshell-jwt/signing.pem`, verify that `openshell-jwt-keys` contains
162171
`signing.pem`, `public.pem`, and `kid`, and that the StatefulSet mounts the

architecture/gateway.md

Lines changed: 27 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -366,32 +366,39 @@ requested for that relay.
366366

367367
## PKI Bootstrap
368368

369-
`openshell-gateway generate-certs` is the one place mTLS materials are
370-
created. Both deployment paths use it:
369+
`openshell-gateway generate-certs` is the one place local mTLS materials and
370+
sandbox JWT signing material are created. Deployment paths use it as follows:
371371

372372
| Output mode | Selector | Layout |
373373
|---|---|---|
374-
| Kubernetes Secrets | (default) `--namespace`, `--server-secret-name`, `--client-secret-name` | Two `kubernetes.io/tls` Secrets with `tls.crt` / `tls.key` / `ca.crt`. |
375-
| Filesystem | `--output-dir <DIR>` | `<dir>/{ca.crt, ca.key, server/tls.{crt,key}, client/tls.{crt,key}}`. Also copies client materials to `$XDG_CONFIG_HOME/openshell/gateways/openshell/mtls/` for CLI auto-discovery. |
374+
| Kubernetes Secrets | (default) `--namespace`, `--server-secret-name`, `--client-secret-name`, `--jwt-secret-name` | Two `kubernetes.io/tls` Secrets with `tls.crt` / `tls.key` / `ca.crt` plus one Opaque sandbox JWT Secret with `signing.pem` / `public.pem` / `kid`. |
375+
| Kubernetes JWT-only Secret | `--namespace`, `--jwt-only`, `--jwt-secret-name` | One Opaque sandbox JWT Secret with `signing.pem` / `public.pem` / `kid`. |
376+
| Filesystem | `--output-dir <DIR>` | `<dir>/{ca.crt, ca.key, server/tls.{crt,key}, client/tls.{crt,key}, jwt/{signing.pem,public.pem,kid}}`. Also copies client materials to `$XDG_CONFIG_HOME/openshell/gateways/openshell/mtls/` for CLI auto-discovery. |
376377

377378
On Kubernetes, the Helm chart runs the command via a pre-install/pre-upgrade
378379
hook Job using the gateway image itself -- no separate cert-generation image,
379-
no extra mirror burden in air-gapped environments. On package-managed local
380-
gateways, the same command runs from the systemd unit's `ExecStartPre` to
381-
bootstrap PKI into the configured local TLS directory on first start. The
382-
Linux package unit defaults that directory to `~/.local/state/openshell/tls`
383-
through `OPENSHELL_LOCAL_TLS_DIR` so certificate generation and runtime
384-
auto-detection use the same path across systemd versions.
385-
386-
Both modes share the same idempotency contract: all targets present -> skip;
387-
partial state -> fail with a recovery hint; nothing present -> generate and
388-
write. This guards mTLS continuity across restarts and upgrades while still
389-
recovering cleanly if an operator deletes everything and starts over.
390-
391-
Operators who manage PKI externally (cert-manager, an enterprise CA, or
392-
pre-created Secrets) disable the Helm hook via `pkiInitJob.enabled=false`.
393-
The chart also ships a `certManager.*` path that produces equivalent Secrets
394-
through cert-manager `Issuer`/`Certificate` resources.
380+
no extra mirror burden in air-gapped environments. In the default built-in PKI
381+
path the hook creates TLS and sandbox JWT Secrets. When cert-manager is enabled,
382+
cert-manager owns TLS Secrets and the hook runs with `--jwt-only` so the
383+
required sandbox JWT Secret still exists before the gateway StatefulSet mounts
384+
it, even if `pkiInitJob.enabled` remains true. On package-managed local
385+
gateways, the same command runs from the systemd
386+
unit's `ExecStartPre` to bootstrap PKI into the configured local TLS directory
387+
on first start. The Linux package unit defaults that directory to
388+
`~/.local/state/openshell/tls` through `OPENSHELL_LOCAL_TLS_DIR` so certificate
389+
generation and runtime auto-detection use the same path across systemd
390+
versions.
391+
392+
The bootstrap paths share the same idempotency contract: all requested targets
393+
present -> skip; partial requested state -> fail with a recovery hint; nothing
394+
requested present -> generate and write. This guards continuity across restarts
395+
and upgrades while still recovering cleanly if an operator deletes everything
396+
and starts over.
397+
398+
Operators who manage TLS PKI with cert-manager enable `certManager.enabled`;
399+
cert-manager takes precedence over built-in TLS generation and the chart still
400+
renders the JWT-only hook. Operators who pre-create all TLS and JWT Secrets can
401+
disable both `pkiInitJob.enabled` and `certManager.enabled`.
395402

396403
## Configuration
397404

crates/openshell-server/src/certgen.rs

Lines changed: 91 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,12 @@
66
//! Two output modes, dispatched by the presence of `--output-dir`:
77
//!
88
//! - **Kubernetes mode** (default): create two `kubernetes.io/tls` Secrets
9-
//! in the supplied namespace. Used by the Helm pre-install hook. Requires
10-
//! `--namespace`, `--server-secret-name`, `--client-secret-name`.
9+
//! and one sandbox-JWT signing Secret in the supplied namespace. Used by
10+
//! the Helm pre-install hook. Requires `--namespace`,
11+
//! `--server-secret-name`, `--client-secret-name`, and `--jwt-secret-name`.
12+
//! - **Kubernetes JWT-only mode** (`--jwt-only`): create only the
13+
//! sandbox-JWT signing Secret. Used when another controller, such as
14+
//! cert-manager, owns the TLS Secrets.
1115
//! - **Local mode** (`--output-dir <DIR>`): write PEMs to the local package
1216
//! filesystem layout. Used by systemd units' `ExecStartPre`. Also copies
1317
//! client materials to
@@ -47,11 +51,11 @@ pub struct CertgenArgs {
4751
namespace: Option<String>,
4852

4953
/// Name of the server TLS Secret (`kubernetes.io/tls`) to create.
50-
#[arg(long, required_unless_present = "output_dir")]
54+
#[arg(long, required_unless_present_any = ["output_dir", "jwt_only"])]
5155
server_secret_name: Option<String>,
5256

5357
/// Name of the client TLS Secret (`kubernetes.io/tls`) to create.
54-
#[arg(long, required_unless_present = "output_dir")]
58+
#[arg(long, required_unless_present_any = ["output_dir", "jwt_only"])]
5559
client_secret_name: Option<String>,
5660

5761
/// Name of the sandbox-JWT signing-key Secret (`Opaque`) to create.
@@ -60,6 +64,11 @@ pub struct CertgenArgs {
6064
#[arg(long, required_unless_present = "output_dir")]
6165
jwt_secret_name: Option<String>,
6266

67+
/// Create only the sandbox-JWT signing-key Secret in Kubernetes mode.
68+
/// This is used when another controller owns TLS Secret provisioning.
69+
#[arg(long, conflicts_with = "output_dir")]
70+
jwt_only: bool,
71+
6372
/// Extra Subject Alternative Name for the server certificate. Repeatable.
6473
/// Auto-detected as an IP address or DNS name.
6574
#[arg(long = "server-san", value_name = "SAN")]
@@ -116,6 +125,51 @@ async fn run_kubernetes(args: &CertgenArgs, bundle: &PkiBundle) -> Result<()> {
116125
.namespace
117126
.as_deref()
118127
.ok_or_else(|| miette::miette!("--namespace is required (or set POD_NAMESPACE)"))?;
128+
129+
let client = Client::try_default()
130+
.await
131+
.into_diagnostic()
132+
.wrap_err("failed to construct in-cluster Kubernetes client")?;
133+
let api: Api<Secret> = Api::namespaced(client, namespace);
134+
135+
if args.jwt_only {
136+
let jwt_name = args
137+
.jwt_secret_name
138+
.as_deref()
139+
.ok_or_else(|| miette::miette!("--jwt-secret-name is required"))?;
140+
let jwt_exists = api
141+
.get_opt(jwt_name)
142+
.await
143+
.into_diagnostic()
144+
.wrap_err_with(|| format!("failed to read secret {jwt_name}"))?
145+
.is_some();
146+
if jwt_exists {
147+
info!(
148+
namespace = %namespace,
149+
jwt = %jwt_name,
150+
"JWT signing secret already exists, skipping."
151+
);
152+
return Ok(());
153+
}
154+
155+
let jwt_secret = jwt_signing_secret(
156+
jwt_name,
157+
&bundle.jwt_signing_key_pem,
158+
&bundle.jwt_public_key_pem,
159+
&bundle.jwt_key_id,
160+
);
161+
api.create(&PostParams::default(), &jwt_secret)
162+
.await
163+
.into_diagnostic()
164+
.wrap_err_with(|| format!("failed to create secret {jwt_name}"))?;
165+
info!(
166+
namespace = %namespace,
167+
jwt = %jwt_name,
168+
"JWT signing secret created."
169+
);
170+
return Ok(());
171+
}
172+
119173
let server_name = args
120174
.server_secret_name
121175
.as_deref()
@@ -124,17 +178,6 @@ async fn run_kubernetes(args: &CertgenArgs, bundle: &PkiBundle) -> Result<()> {
124178
.client_secret_name
125179
.as_deref()
126180
.ok_or_else(|| miette::miette!("--client-secret-name is required"))?;
127-
let jwt_name = args
128-
.jwt_secret_name
129-
.as_deref()
130-
.ok_or_else(|| miette::miette!("--jwt-secret-name is required"))?;
131-
132-
let client = Client::try_default()
133-
.await
134-
.into_diagnostic()
135-
.wrap_err("failed to construct in-cluster Kubernetes client")?;
136-
let api: Api<Secret> = Api::namespaced(client, namespace);
137-
138181
let server_exists = api
139182
.get_opt(server_name)
140183
.await
@@ -147,6 +190,11 @@ async fn run_kubernetes(args: &CertgenArgs, bundle: &PkiBundle) -> Result<()> {
147190
.into_diagnostic()
148191
.wrap_err_with(|| format!("failed to read secret {client_name}"))?
149192
.is_some();
193+
194+
let jwt_name = args
195+
.jwt_secret_name
196+
.as_deref()
197+
.ok_or_else(|| miette::miette!("--jwt-secret-name is required"))?;
150198
let jwt_exists = api
151199
.get_opt(jwt_name)
152200
.await
@@ -193,6 +241,34 @@ async fn run_kubernetes(args: &CertgenArgs, bundle: &PkiBundle) -> Result<()> {
193241
K8sAction::CreateAll => {}
194242
}
195243

244+
create_tls_secrets(&api, server_name, client_name, bundle).await?;
245+
let jwt_secret = jwt_signing_secret(
246+
jwt_name,
247+
&bundle.jwt_signing_key_pem,
248+
&bundle.jwt_public_key_pem,
249+
&bundle.jwt_key_id,
250+
);
251+
api.create(&PostParams::default(), &jwt_secret)
252+
.await
253+
.into_diagnostic()
254+
.wrap_err_with(|| format!("failed to create secret {jwt_name}"))?;
255+
256+
info!(
257+
namespace = %namespace,
258+
server = %server_name,
259+
client = %client_name,
260+
jwt = %jwt_name,
261+
"PKI secrets created."
262+
);
263+
Ok(())
264+
}
265+
266+
async fn create_tls_secrets(
267+
api: &Api<Secret>,
268+
server_name: &str,
269+
client_name: &str,
270+
bundle: &PkiBundle,
271+
) -> Result<()> {
196272
let server_secret = tls_secret(
197273
server_name,
198274
&bundle.server_cert_pem,
@@ -205,12 +281,6 @@ async fn run_kubernetes(args: &CertgenArgs, bundle: &PkiBundle) -> Result<()> {
205281
&bundle.client_key_pem,
206282
&bundle.ca_cert_pem,
207283
);
208-
let jwt_secret = jwt_signing_secret(
209-
jwt_name,
210-
&bundle.jwt_signing_key_pem,
211-
&bundle.jwt_public_key_pem,
212-
&bundle.jwt_key_id,
213-
);
214284

215285
api.create(&PostParams::default(), &server_secret)
216286
.await
@@ -220,18 +290,6 @@ async fn run_kubernetes(args: &CertgenArgs, bundle: &PkiBundle) -> Result<()> {
220290
.await
221291
.into_diagnostic()
222292
.wrap_err_with(|| format!("failed to create secret {client_name}"))?;
223-
api.create(&PostParams::default(), &jwt_secret)
224-
.await
225-
.into_diagnostic()
226-
.wrap_err_with(|| format!("failed to create secret {jwt_name}"))?;
227-
228-
info!(
229-
namespace = %namespace,
230-
server = %server_name,
231-
client = %client_name,
232-
jwt = %jwt_name,
233-
"PKI secrets created."
234-
);
235293
Ok(())
236294
}
237295

crates/openshell-server/src/cli.rs

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -983,6 +983,31 @@ mod tests {
983983
));
984984
}
985985

986+
#[test]
987+
fn generate_certs_jwt_only_parses_without_tls_secret_names() {
988+
let _lock = ENV_LOCK
989+
.lock()
990+
.unwrap_or_else(std::sync::PoisonError::into_inner);
991+
let _g1 = EnvVarGuard::remove("OPENSHELL_DB_URL");
992+
let _g2 = EnvVarGuard::remove("POD_NAMESPACE");
993+
994+
let cli = Cli::try_parse_from([
995+
"openshell-gateway",
996+
"generate-certs",
997+
"--namespace",
998+
"openshell",
999+
"--jwt-only",
1000+
"--jwt-secret-name",
1001+
"openshell-jwt-keys",
1002+
])
1003+
.expect("--jwt-only should make TLS secret-name flags optional");
1004+
1005+
assert!(matches!(
1006+
cli.command,
1007+
Some(super::Commands::GenerateCerts(_))
1008+
));
1009+
}
1010+
9861011
#[test]
9871012
fn bare_invocation_with_no_db_url_parses_for_runtime_defaults() {
9881013
// db_url is Option<String> at the clap level so subcommand parsing

deploy/helm/openshell/README.md

Lines changed: 12 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -109,22 +109,19 @@ Append these flags to any of the PostgreSQL commands above for OpenShift:
109109
--set securityContext.runAsUser=null
110110
```
111111

112-
## PKI bootstrap
112+
## Secret bootstrap
113113

114114
By default, a pre-install/pre-upgrade hook Job runs `openshell-gateway generate-certs`
115-
to create the gateway's server and client mTLS Secrets. The Job uses the gateway image
116-
itself, so air-gapped environments only need to mirror that one image (no separate
117-
openssl/alpine sidecar).
115+
to create the gateway's server/client mTLS Secrets and sandbox JWT signing Secret.
116+
The Job uses the gateway image itself, so air-gapped environments only need to
117+
mirror that one image (no separate openssl/alpine sidecar).
118118

119-
The Job is idempotent:
120-
121-
- Both target Secrets exist: log and exit 0.
122-
- Exactly one exists: fail with `kubectl delete secret -n <ns> <server> <client>` recovery hint.
123-
- Neither exists: generate a CA, server cert, and client cert; POST both `kubernetes.io/tls` Secrets (`tls.crt`, `tls.key`, `ca.crt`).
124-
125-
Disable with `--set pkiInitJob.enabled=false` when bringing your own PKI (cert-manager,
126-
external CA, or pre-created Secrets). See `certManager.*` in `values.yaml` for the
127-
cert-manager alternative.
119+
When `certManager.enabled=true`, cert-manager owns the TLS Secrets and the chart
120+
runs the same hook in JWT-only mode because cert-manager does not create the
121+
sandbox JWT signing Secret. This precedence applies even if
122+
`pkiInitJob.enabled` remains true. Set `pkiInitJob.enabled=false` only when an
123+
external non-cert-manager TLS source manages TLS and you pre-create the sandbox
124+
JWT signing Secret.
128125

129126
## Values
130127

@@ -135,7 +132,7 @@ cert-manager alternative.
135132
| certManager.certificateDuration | string | `"8760h"` | Duration for cert-manager-issued certificates. |
136133
| certManager.certificateRenewBefore | string | `"720h"` | Renewal window for cert-manager-issued certificates. |
137134
| certManager.clientCaFromServerTlsSecret | bool | `true` | Mount gateway client CA from the server TLS secret's ca.crt (populated by cert-manager for certs issued by a CA Issuer). Avoids a separate openshell-server-client-ca Secret. |
138-
| certManager.enabled | bool | `false` | Create cert-manager Issuer and Certificate resources instead of using the PKI bootstrap Job. |
135+
| certManager.enabled | bool | `false` | Create cert-manager Issuer and Certificate resources. When enabled, cert-manager owns TLS and the chart runs a JWT-only certgen hook to create the sandbox JWT signing Secret that cert-manager does not manage. |
139136
| certManager.serverDnsNames | list | `["openshell","openshell.openshell.svc","openshell.openshell.svc.cluster.local","localhost","openshell.localhost","*.openshell.localhost","host.docker.internal"]` | DNS SANs on the cert-manager-issued server certificate. |
140137
| certManager.serverIpAddresses | list | `["127.0.0.1"]` | IP SANs on the cert-manager-issued server certificate. |
141138
| fullnameOverride | string | `""` | Override the full generated resource name. |
@@ -155,7 +152,7 @@ cert-manager alternative.
155152
| nameOverride | string | `"openshell"` | Override the chart name used in generated resource names. |
156153
| networkPolicy.enabled | bool | `true` | Create a NetworkPolicy restricting SSH ingress on sandbox pods to the gateway. |
157154
| nodeSelector | object | `{}` | Node selector for the gateway pod. |
158-
| pkiInitJob.enabled | bool | `true` | Run a pre-install/pre-upgrade Job that creates gateway and client mTLS Secrets. |
155+
| pkiInitJob.enabled | bool | `true` | Run a pre-install/pre-upgrade Job that creates gateway and client mTLS Secrets. When certManager.enabled=true, cert-manager owns TLS and this same hook runs in JWT-only mode even if pkiInitJob.enabled remains true. |
159156
| pkiInitJob.serverDnsNames | list | `[]` | Extra DNS SANs to append to the server certificate. |
160157
| pkiInitJob.serverIpAddresses | list | `[]` | Extra IP SANs to append to the server certificate. |
161158
| podAnnotations | object | `{}` | Extra annotations to add to the gateway pod. |

deploy/helm/openshell/README.md.gotmpl

Lines changed: 11 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -109,22 +109,19 @@ Append these flags to any of the PostgreSQL commands above for OpenShift:
109109
--set securityContext.runAsUser=null
110110
```
111111

112-
## PKI bootstrap
112+
## Secret bootstrap
113113

114114
By default, a pre-install/pre-upgrade hook Job runs `openshell-gateway generate-certs`
115-
to create the gateway's server and client mTLS Secrets. The Job uses the gateway image
116-
itself, so air-gapped environments only need to mirror that one image (no separate
117-
openssl/alpine sidecar).
118-
119-
The Job is idempotent:
120-
121-
- Both target Secrets exist: log and exit 0.
122-
- Exactly one exists: fail with `kubectl delete secret -n <ns> <server> <client>` recovery hint.
123-
- Neither exists: generate a CA, server cert, and client cert; POST both `kubernetes.io/tls` Secrets (`tls.crt`, `tls.key`, `ca.crt`).
124-
125-
Disable with `--set pkiInitJob.enabled=false` when bringing your own PKI (cert-manager,
126-
external CA, or pre-created Secrets). See `certManager.*` in `values.yaml` for the
127-
cert-manager alternative.
115+
to create the gateway's server/client mTLS Secrets and sandbox JWT signing Secret.
116+
The Job uses the gateway image itself, so air-gapped environments only need to
117+
mirror that one image (no separate openssl/alpine sidecar).
118+
119+
When `certManager.enabled=true`, cert-manager owns the TLS Secrets and the chart
120+
runs the same hook in JWT-only mode because cert-manager does not create the
121+
sandbox JWT signing Secret. This precedence applies even if
122+
`pkiInitJob.enabled` remains true. Set `pkiInitJob.enabled=false` only when an
123+
external non-cert-manager TLS source manages TLS and you pre-create the sandbox
124+
JWT signing Secret.
128125

129126
{{ template "chart.valuesSection" . }}
130127
{{ template "helm-docs.versionFooter" . }}

deploy/helm/openshell/ci/values-cert-manager.yaml

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,5 @@
77
server:
88
disableTls: false
99

10-
pkiInitJob:
11-
enabled: false
12-
1310
certManager:
1411
enabled: true

0 commit comments

Comments
 (0)