fix(orchestrator): DKIM volume→host migration runs on every self-heal pass + busybox-safe copy#73
Merged
Merged
Conversation
… pass + busybox-safe copy
The v0.1.27 DKIM named-volume -> host-bind migration (MigrateLegacyDKIMVolume)
could be skipped entirely on upgrade, stranding volume-only DKIM keys and
breaking outbound signing for affected domains. Two bugs, both caught by the
v0.1.27 mx1 canary (vectiscloud.com lost signing):
1. Gating: the migration call sat AFTER self-heal's
'if !composeRewritten && !configsRewritten { return }' early-return. With
render-from-target upgrades (v0.1.25+), apply Phase 3.5 rewrites the compose
to the host-mount form and recreates rspamd/postfix BEFORE the new
orchestrator boots, so self-heal sees no drift, returns early, and never runs
the migration. Move it ahead of the early-return so it runs on every pass;
when it copies stranded keys in the no-drift path, restart rspamd + postfix
(already recreated onto the host mount by Phase 3.5) to pick them up.
2. Copy command: 'cp -an /from/. /to/' silently copies nothing under busybox
(the alpine migration image) once /to holds any entry. Replace with a
per-domain no-clobber loop that skips existing host keys and copies the rest,
reporting whether anything moved.
Fresh installs are unaffected (they render host-mount + write keys to host from
the start). Only upgrades from <=v0.1.26 with volume-resident keys were at risk.
There was a problem hiding this comment.
Pull request overview
Fixes a regression in the orchestrator self-heal path where the DKIM named-volume → host-bind migration could be skipped on upgrade, leaving DKIM private keys stranded in the legacy volume and causing outbound signing failures.
Changes:
- Runs the DKIM migration before the self-heal no-op early return so it executes on every self-heal pass.
- Updates
MigrateLegacyDKIMVolumeto return whether any keys were actually copied, enabling conditional restarts. - Replaces the busybox-incompatible
cp -an /from/. /to/with a custom copy loop.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| internal/orchestrator/docker.go | Updates DKIM migration container copy logic and returns a copied flag. |
| internal/orchestrator/compose_selfheal.go | Ensures DKIM migration runs even when no drift is detected; conditionally restarts signing services. |
Comment on lines
+278
to
+279
| "sh", "-c", | ||
| `set -e; for d in /from/*/; do [ -e "$d" ] || continue; n=$(basename "$d"); if [ ! -e "/to/$n" ]; then cp -a "/from/$n" "/to/$n" && echo "copied:$n"; fi; done`, |
Comment on lines
249
to
+254
| if !composeRewritten && !configsRewritten { | ||
| // Apply Phase 3.5 already recreated the signing services onto the host | ||
| // mount. If we just copied stranded keys into it, those running services | ||
| // don't see them yet — restart rspamd + postfix to close the unsigned | ||
| // window. When nothing was copied this whole branch is a pure no-op pass. | ||
| if dkimCopied { |
- Copy per key-file (not per domain dir): a domain whose host dir already exists but is missing a newer/rotated selector key (created via host CLI, then DKIM-rotated by the pre-migration API into the legacy volume) now gets that key copied. mkdir -p the dest domain dir; still never clobbers an existing host key. Verified on sa1001 busybox incl the rotated-selector case. - Restart rspamd+postfix whenever keys were copied, in the drift path too: ApplyComposeServices can no-op them (image+compose-config unchanged), so they would not pick up freshly-copied keys without an explicit restart. Merge into the existing config-changed restart set (deduped) so a service restarts once.
Contributor
Author
|
Both Copilot review points addressed in 851d5ed:
CI re-running on the new commit. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The v0.1.27 DKIM named-volume → host-bind migration (
MigrateLegacyDKIMVolume) could be skipped entirely on upgrade, stranding volume-only DKIM keys and breaking outbound signing for affected domains. Found by canarying v0.1.27 on the mx1 portfolio box, wherevectiscloud.com(a volume-only key) lost its signature after the upgrade.Two distinct bugs, both fixed here:
1. Migration gated behind drift it never sees on upgrade
The migration call sat after self-heal's
if !composeRewritten && !configsRewritten { return }early-return. With render-from-target upgrades (v0.1.25+), apply Phase 3.5 rewrites the compose to the host-mount form and recreates rspamd/postfix before the new orchestrator boots. So when the new orchestrator's self-heal runs, it sees no drift, returns early, and never reaches the migration — confirmed by the new orchestrator logging zero "dkim migration" lines on the mx1 upgrade.Fix: move the migration ahead of the early-return so it runs on every self-heal pass (idempotent, no-op once the legacy volume is gone). In the no-drift path, if it copied stranded keys, restart
rspamd+postfix(already recreated onto the host mount by Phase 3.5) so they pick the keys up.2.
cp -an /from/. /to/copies nothing under busyboxThe migration image is
alpine:3.24(busybox).cp -an /from/. /to/silently copies nothing once/toalready holds any entry — busybox bails at the existing top-level dir instead of merging per-file like GNU cp. Replaced with a per-domain, no-clobber loop that skips existing host keys, copies the rest, and echoes each copied domain so the Go side knows whether a signing-service restart is needed.MigrateLegacyDKIMVolumenow returns(copied bool, err error).Blast radius
vectis_dkim-keysvolume. On such a box the post-upgrade host mount would be empty/partial and signing would break for every volume-only domain.Testing
gofmt/go vet/go build/go test ./internal/orchestrator/...green on sa1001 (Go 1.26.4).cp -an /from/. /to/leftbeta.comuncopied (rc=0); new loop copied it, left an existingalpha.comhost key unclobbered, was idempotent on re-run, and emitted thecopied:marker.d=vectiscloud.com; s=202606signature verified on a live message). This PR makes the path correct for the next upgrade (prod).Rollout
The stable release channel was rolled back to v0.1.26 while v0.1.27 carried this bug. Ship this as v0.1.28; prod (held on v0.1.26, keys volume-resident) is the box this protects.