Skip to content

fix: short-circuit cert-rotation reconciler under USE_OLM_TLS#30

Merged
chengjingtao merged 3 commits into
alauda/release-1.10from
fix/olm-cert-rotation
May 13, 2026
Merged

fix: short-circuit cert-rotation reconciler under USE_OLM_TLS#30
chengjingtao merged 3 commits into
alauda/release-1.10from
fix/olm-cert-rotation

Conversation

@chengjingtao
Copy link
Copy Markdown

@chengjingtao chengjingtao commented May 13, 2026

Summary

Follow-up to #25 (using olm cert) and #29 (restore #25 after vuln-0512). Same pattern (early-return on USE_OLM_TLS), applied to the third knative.dev/pkg/webhook/ component that wasn't covered, with a companion hack/patches/ file so the next go mod vendor doesn't wipe it.

What

  1. vendor/knative.dev/pkg/webhook/certificates/certificates.go — adds at the top of reconcileCertificate:

    if os.Getenv("USE_OLM_TLS") != "" { // olm manages cert rotation
        return nil
    }

    Mirrors PR fix: using olm cert #25's pattern in vendor/knative.dev/pkg/webhook/resourcesemantics/conversion/reconciler.go.

  2. hack/patches/008-certificates-reconciler.patch — companion patch file, sibling of the existing 006-conversion-webhook.patch / 007-conversion-reconciler.patch from PR fix: using olm cert #25. Reapplied by make update-deps (= git apply hack/patches/*.patch) after any go mod vendor run, so this fix can't get silently dropped the way PR fix: using olm cert #25's vendor edits were dropped by PR fix: vuln 0512 #28's vuln-0512 sweep.

    Verified by round-trip: revert vendor file to 34c2b0fe, git apply hack/patches/008-certificates-reconciler.patch → guard restored.

Why

The WebhookCertificates reconciler is the third component in knative.dev/pkg/webhook/ that touches the webhook cert secret. PR #25 patched the other two (TLS-read in webhook.go, CRD caBundle sync in conversion/reconciler.go) but left the cert-rotation reconciler alone.

In OLM mode the reconciler:

  1. Reads the OLM-provisioned secret (type kubernetes.io/tls, keys tls.crt / tls.key / olmCAKey).
  2. Logs Certificate secret "operator-webhook-service-cert" is missing key "server-key.pem" at INFO (line 78) — its expected key names don't match what OLM populates.
  3. Self-signs a new cert via certresources.MakeSecret, which only emits server-cert.pem / server-key.pem / ca-cert.pem.
  4. Replaces secret.Data wholesale — would clobber OLM's tls.crt / tls.key.
  5. Sends the Update — apiserver rejects on type validation: Reconcile error ... Secret ... is invalid: [data[tls.crt]: Required value, data[tls.key]: Required value].
  6. Exponential-backoff retries forever; secret never converges.

Observed on the test cluster running #29's webhook image v1.10.1-34c2b0fe against the OLM-installed bundle knative-operator-bundle:v3.20.9-hotfix.129.6.gddf86032-fix-vuln-0512:

operator-webhook.WebhookCertificates  certificates/certificates.go:78
  "Certificate secret \"operator-webhook-service-cert\" is missing key \"server-key.pem\""
operator-webhook.WebhookCertificates  controller/controller.go:566
  Reconcile error ... Secret "operator-webhook-service-cert" is invalid:
  [data[tls.crt]: Required value, data[tls.key]: Required value]

Functionally harmless — apiserver's type validation is what prevents the reconciler from corrupting OLM's cert — but the log spam runs continuously (every few seconds to every minute) and hides real issues. The webhook itself reads certs via #25's patched GetCertificate and serves correctly.

Verify after merge

After rebuilding + rolling the bundle, the two log signatures above should both disappear, and operator-webhook.WebhookCertificates should fall silent in OLM-installed clusters. Standalone (non-OLM) installs are unaffected — USE_OLM_TLS is unset, so the reconciler runs as before.

Note on commit history

This PR's branch ended up with three commits because of an intermediate test workflow that staged an unintended revert. Net branch diff is +28 lines across two files (the 5-line vendor edit + the 23-line patch file). Recommend squash on merge.

🤖 Generated with Claude Code

chengjingtao and others added 3 commits May 13, 2026 08:11
PR #25 (using olm cert) patched two of the three knative.dev/pkg webhook
components that touch the cert secret:

  1. vendor/knative.dev/pkg/webhook/webhook.go
     TLS server's GetCertificate — read tls.crt / tls.key when USE_OLM_TLS
  2. vendor/knative.dev/pkg/webhook/resourcesemantics/conversion/reconciler.go
     CRD caBundle sync — early-return when USE_OLM_TLS (OLM syncs caBundle)

It missed the third one: vendor/knative.dev/pkg/webhook/certificates/certificates.go,
the WebhookCertificates cert-rotation reconciler. In OLM mode that reconciler:

  1. Reads the OLM-provisioned secret (type kubernetes.io/tls, keys tls.crt /
     tls.key / olmCAKey).
  2. Sees its expected server-key.pem / server-cert.pem / ca-cert.pem missing,
     logs `Certificate secret "operator-webhook-service-cert" is missing key
     "server-key.pem"` at INFO.
  3. Self-signs a new cert via certresources.MakeSecret (which only emits the
     three server-*.pem keys).
  4. Replaces secret.Data wholesale — would clobber the OLM-injected tls.crt
     / tls.key.
  5. Sends the Update; apiserver rejects because the kubernetes.io/tls secret
     type requires tls.crt / tls.key — logs `Reconcile error ... Secret ... is
     invalid: [data[tls.crt]: Required value, data[tls.key]: Required value]`
     at ERROR, then exponential-backoff retries indefinitely.

Functionally harmless (the apiserver rejection is what stops the reconciler
from corrupting OLM's cert), but the log spam is loud and hides real issues.
The webhook server itself reads via the patched GetCertificate from #25 and
serves correctly.

Mirror PR #25's pattern — early-return at the top of reconcileCertificate
when USE_OLM_TLS is set so the reconciler doesn't even try to rotate certs
OLM is already managing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirrors hack/patches/006-conversion-webhook.patch and 007-conversion-reconciler.patch
from PR #25 so the cert-rotation reconciler's USE_OLM_TLS guard also survives
`go mod vendor` runs.

The reason PR #25's vendor edits disappeared after the vuln-0512 sweep was that
nobody ran `make update-deps` (= `git apply hack/patches/*.patch`) after the
vendor regeneration. Adding 008 alongside its sibling patches keeps the recovery
mechanical: `go mod vendor && make update-deps` reproduces the working tree.

Without this file, a future vendor regeneration would silently revert this PR's
fix the same way PR #28 reverted PR #25's.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ed285a3 inadvertently staged a revert of vendor/.../certificates.go (a stale
`git checkout 34c2b0f -- vendor/...` from a patch round-trip test had left
the index pointing at pre-fix content). HEAD's working tree still had the
correct guarded version, but the commit captured the reverted index state.

This commit puts the vendor edit back. End state on the branch:

  c62378b  fix: short-circuit cert-rotation reconciler under USE_OLM_TLS  (vendor edit)
  ed285a3  hack/patches: add 008 for cert-rotation reconciler             (patch file only, vendor edit reverted)
  HEAD     fix: re-include vendor edit dropped by 008 patch commit        (vendor edit restored)

Net diff PR #30 = +5 vendor lines + new hack/patches/008 file.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@chengjingtao chengjingtao merged commit 1b57ea7 into alauda/release-1.10 May 13, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant