Skip to content

fix(orchestrator): DKIM volume→host migration runs on every self-heal pass + busybox-safe copy#73

Merged
oakshape merged 2 commits into
mainfrom
fix/dkim-selfheal-migration-v0.1.28
Jun 13, 2026
Merged

fix(orchestrator): DKIM volume→host migration runs on every self-heal pass + busybox-safe copy#73
oakshape merged 2 commits into
mainfrom
fix/dkim-selfheal-migration-v0.1.28

Conversation

@oakshape

Copy link
Copy Markdown
Contributor

Summary

The v0.1.27 DKIM named-volume → host-bind migration (MigrateLegacyDKIMVolume) could be skipped entirely on upgrade, stranding volume-only DKIM keys and breaking outbound signing for affected domains. Found by canarying v0.1.27 on the mx1 portfolio box, where vectiscloud.com (a volume-only key) lost its signature after the upgrade.

Two distinct bugs, both fixed here:

1. Migration gated behind drift it never sees on upgrade

The migration call sat after self-heal's if !composeRewritten && !configsRewritten { return } early-return. With render-from-target upgrades (v0.1.25+), apply Phase 3.5 rewrites the compose to the host-mount form and recreates rspamd/postfix before the new orchestrator boots. So when the new orchestrator's self-heal runs, it sees no drift, returns early, and never reaches the migration — confirmed by the new orchestrator logging zero "dkim migration" lines on the mx1 upgrade.

Fix: move the migration ahead of the early-return so it runs on every self-heal pass (idempotent, no-op once the legacy volume is gone). In the no-drift path, if it copied stranded keys, restart rspamd + postfix (already recreated onto the host mount by Phase 3.5) so they pick the keys up.

2. cp -an /from/. /to/ copies nothing under busybox

The migration image is alpine:3.24 (busybox). cp -an /from/. /to/ silently copies nothing once /to already holds any entry — busybox bails at the existing top-level dir instead of merging per-file like GNU cp. Replaced with a per-domain, no-clobber loop that skips existing host keys, copies the rest, and echoes each copied domain so the Go side knows whether a signing-service restart is needed.

MigrateLegacyDKIMVolume now returns (copied bool, err error).

Blast radius

  • Fresh installs: unaffected — they render the host-mount compose and write DKIM keys to the host path from the start; the legacy volume never exists.
  • At risk: upgrades from ≤v0.1.26 where keys live in the vectis_dkim-keys volume. On such a box the post-upgrade host mount would be empty/partial and signing would break for every volume-only domain.

Testing

  • gofmt/go vet/go build/go test ./internal/orchestrator/... green on sa1001 (Go 1.26.4).
  • Live busybox reproduction on sa1001: old cp -an /from/. /to/ left beta.com uncopied (rc=0); new loop copied it, left an existing alpha.com host key unclobbered, was idempotent on re-run, and emitted the copied: marker.
  • mx1 (the canary) is already on v0.1.27 and was remediated by hand (keys restored, d=vectiscloud.com; s=202606 signature verified on a live message). This PR makes the path correct for the next upgrade (prod).

Rollout

The stable release channel was rolled back to v0.1.26 while v0.1.27 carried this bug. Ship this as v0.1.28; prod (held on v0.1.26, keys volume-resident) is the box this protects.

… pass + busybox-safe copy

The v0.1.27 DKIM named-volume -> host-bind migration (MigrateLegacyDKIMVolume)
could be skipped entirely on upgrade, stranding volume-only DKIM keys and
breaking outbound signing for affected domains. Two bugs, both caught by the
v0.1.27 mx1 canary (vectiscloud.com lost signing):

1. Gating: the migration call sat AFTER self-heal's
   'if !composeRewritten && !configsRewritten { return }' early-return. With
   render-from-target upgrades (v0.1.25+), apply Phase 3.5 rewrites the compose
   to the host-mount form and recreates rspamd/postfix BEFORE the new
   orchestrator boots, so self-heal sees no drift, returns early, and never runs
   the migration. Move it ahead of the early-return so it runs on every pass;
   when it copies stranded keys in the no-drift path, restart rspamd + postfix
   (already recreated onto the host mount by Phase 3.5) to pick them up.

2. Copy command: 'cp -an /from/. /to/' silently copies nothing under busybox
   (the alpine migration image) once /to holds any entry. Replace with a
   per-domain no-clobber loop that skips existing host keys and copies the rest,
   reporting whether anything moved.

Fresh installs are unaffected (they render host-mount + write keys to host from
the start). Only upgrades from <=v0.1.26 with volume-resident keys were at risk.
Copilot AI review requested due to automatic review settings June 13, 2026 10:16

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a regression in the orchestrator self-heal path where the DKIM named-volume → host-bind migration could be skipped on upgrade, leaving DKIM private keys stranded in the legacy volume and causing outbound signing failures.

Changes:

  • Runs the DKIM migration before the self-heal no-op early return so it executes on every self-heal pass.
  • Updates MigrateLegacyDKIMVolume to return whether any keys were actually copied, enabling conditional restarts.
  • Replaces the busybox-incompatible cp -an /from/. /to/ with a custom copy loop.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
internal/orchestrator/docker.go Updates DKIM migration container copy logic and returns a copied flag.
internal/orchestrator/compose_selfheal.go Ensures DKIM migration runs even when no drift is detected; conditionally restarts signing services.

Comment thread internal/orchestrator/docker.go Outdated
Comment on lines +278 to +279
"sh", "-c",
`set -e; for d in /from/*/; do [ -e "$d" ] || continue; n=$(basename "$d"); if [ ! -e "/to/$n" ]; then cp -a "/from/$n" "/to/$n" && echo "copied:$n"; fi; done`,
Comment on lines 249 to +254
if !composeRewritten && !configsRewritten {
// Apply Phase 3.5 already recreated the signing services onto the host
// mount. If we just copied stranded keys into it, those running services
// don't see them yet — restart rspamd + postfix to close the unsigned
// window. When nothing was copied this whole branch is a pure no-op pass.
if dkimCopied {
- Copy per key-file (not per domain dir): a domain whose host dir already
  exists but is missing a newer/rotated selector key (created via host CLI,
  then DKIM-rotated by the pre-migration API into the legacy volume) now gets
  that key copied. mkdir -p the dest domain dir; still never clobbers an
  existing host key. Verified on sa1001 busybox incl the rotated-selector case.
- Restart rspamd+postfix whenever keys were copied, in the drift path too:
  ApplyComposeServices can no-op them (image+compose-config unchanged), so they
  would not pick up freshly-copied keys without an explicit restart. Merge into
  the existing config-changed restart set (deduped) so a service restarts once.
@oakshape

Copy link
Copy Markdown
Contributor Author

Both Copilot review points addressed in 851d5ed:

  1. Per key-file copy (docker.go) — the migration now copies /from/*/* with mkdir -p on the dest domain dir, so a domain whose host dir already exists but is missing a newer/rotated selector key still gets it. Verified on sa1001 busybox: an existing alpha.com/ received its volume-only 202607.key while 202606.key was left untouched; new domains and idempotency also confirmed.
  2. Restart in drift path (compose_selfheal.go) — rspamd+postfix now restart whenever dkimCopied is true, merged (deduped) into the post-ApplyComposeServices restart set, since that call can no-op them when image+compose-config are unchanged.

CI re-running on the new commit.

@oakshape oakshape merged commit 2348942 into main Jun 13, 2026
7 checks passed
@oakshape oakshape deleted the fix/dkim-selfheal-migration-v0.1.28 branch June 13, 2026 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants