fix(orchestrator): DKIM volume→host migration runs on every self-heal pass + busybox-safe copy by oakshape · Pull Request #73 · Veltara-Works/vectis

oakshape · 2026-06-13T10:16:33Z

Summary

The v0.1.27 DKIM named-volume → host-bind migration (MigrateLegacyDKIMVolume) could be skipped entirely on upgrade, stranding volume-only DKIM keys and breaking outbound signing for affected domains. Found by canarying v0.1.27 on the mx1 portfolio box, where vectiscloud.com (a volume-only key) lost its signature after the upgrade.

Two distinct bugs, both fixed here:

1. Migration gated behind drift it never sees on upgrade

The migration call sat after self-heal's if !composeRewritten && !configsRewritten { return } early-return. With render-from-target upgrades (v0.1.25+), apply Phase 3.5 rewrites the compose to the host-mount form and recreates rspamd/postfix before the new orchestrator boots. So when the new orchestrator's self-heal runs, it sees no drift, returns early, and never reaches the migration — confirmed by the new orchestrator logging zero "dkim migration" lines on the mx1 upgrade.

Fix: move the migration ahead of the early-return so it runs on every self-heal pass (idempotent, no-op once the legacy volume is gone). In the no-drift path, if it copied stranded keys, restart rspamd + postfix (already recreated onto the host mount by Phase 3.5) so they pick the keys up.

2. `cp -an /from/. /to/` copies nothing under busybox

The migration image is alpine:3.24 (busybox). cp -an /from/. /to/ silently copies nothing once /to already holds any entry — busybox bails at the existing top-level dir instead of merging per-file like GNU cp. Replaced with a per-domain, no-clobber loop that skips existing host keys, copies the rest, and echoes each copied domain so the Go side knows whether a signing-service restart is needed.

MigrateLegacyDKIMVolume now returns (copied bool, err error).

Blast radius

Fresh installs: unaffected — they render the host-mount compose and write DKIM keys to the host path from the start; the legacy volume never exists.
At risk: upgrades from ≤v0.1.26 where keys live in the vectis_dkim-keys volume. On such a box the post-upgrade host mount would be empty/partial and signing would break for every volume-only domain.

Testing

gofmt/go vet/go build/go test ./internal/orchestrator/... green on sa1001 (Go 1.26.4).
Live busybox reproduction on sa1001: old cp -an /from/. /to/ left beta.com uncopied (rc=0); new loop copied it, left an existing alpha.com host key unclobbered, was idempotent on re-run, and emitted the copied: marker.
mx1 (the canary) is already on v0.1.27 and was remediated by hand (keys restored, d=vectiscloud.com; s=202606 signature verified on a live message). This PR makes the path correct for the next upgrade (prod).

Rollout

The stable release channel was rolled back to v0.1.26 while v0.1.27 carried this bug. Ship this as v0.1.28; prod (held on v0.1.26, keys volume-resident) is the box this protects.

… pass + busybox-safe copy The v0.1.27 DKIM named-volume -> host-bind migration (MigrateLegacyDKIMVolume) could be skipped entirely on upgrade, stranding volume-only DKIM keys and breaking outbound signing for affected domains. Two bugs, both caught by the v0.1.27 mx1 canary (vectiscloud.com lost signing): 1. Gating: the migration call sat AFTER self-heal's 'if !composeRewritten && !configsRewritten { return }' early-return. With render-from-target upgrades (v0.1.25+), apply Phase 3.5 rewrites the compose to the host-mount form and recreates rspamd/postfix BEFORE the new orchestrator boots, so self-heal sees no drift, returns early, and never runs the migration. Move it ahead of the early-return so it runs on every pass; when it copies stranded keys in the no-drift path, restart rspamd + postfix (already recreated onto the host mount by Phase 3.5) to pick them up. 2. Copy command: 'cp -an /from/. /to/' silently copies nothing under busybox (the alpine migration image) once /to holds any entry. Replace with a per-domain no-clobber loop that skips existing host keys and copies the rest, reporting whether anything moved. Fresh installs are unaffected (they render host-mount + write keys to host from the start). Only upgrades from <=v0.1.26 with volume-resident keys were at risk.

Copilot

Pull request overview

Fixes a regression in the orchestrator self-heal path where the DKIM named-volume → host-bind migration could be skipped on upgrade, leaving DKIM private keys stranded in the legacy volume and causing outbound signing failures.

Changes:

Runs the DKIM migration before the self-heal no-op early return so it executes on every self-heal pass.
Updates MigrateLegacyDKIMVolume to return whether any keys were actually copied, enabling conditional restarts.
Replaces the busybox-incompatible cp -an /from/. /to/ with a custom copy loop.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
internal/orchestrator/docker.go	Updates DKIM migration container copy logic and returns a `copied` flag.
internal/orchestrator/compose_selfheal.go	Ensures DKIM migration runs even when no drift is detected; conditionally restarts signing services.

+		"sh", "-c",
+		`set -e; for d in /from/*/; do [ -e "$d" ] || continue; n=$(basename "$d"); if [ ! -e "/to/$n" ]; then cp -a "/from/$n" "/to/$n" && echo "copied:$n"; fi; done`,


 	if !composeRewritten && !configsRewritten {
+		// Apply Phase 3.5 already recreated the signing services onto the host
+		// mount. If we just copied stranded keys into it, those running services
+		// don't see them yet — restart rspamd + postfix to close the unsigned
+		// window. When nothing was copied this whole branch is a pure no-op pass.
+		if dkimCopied {


- Copy per key-file (not per domain dir): a domain whose host dir already exists but is missing a newer/rotated selector key (created via host CLI, then DKIM-rotated by the pre-migration API into the legacy volume) now gets that key copied. mkdir -p the dest domain dir; still never clobbers an existing host key. Verified on sa1001 busybox incl the rotated-selector case. - Restart rspamd+postfix whenever keys were copied, in the drift path too: ApplyComposeServices can no-op them (image+compose-config unchanged), so they would not pick up freshly-copied keys without an explicit restart. Merge into the existing config-changed restart set (deduped) so a service restarts once.

oakshape · 2026-06-13T10:25:41Z

Both Copilot review points addressed in 851d5ed:

Per key-file copy (docker.go) — the migration now copies /from/*/* with mkdir -p on the dest domain dir, so a domain whose host dir already exists but is missing a newer/rotated selector key still gets it. Verified on sa1001 busybox: an existing alpha.com/ received its volume-only 202607.key while 202606.key was left untouched; new domains and idempotency also confirmed.
Restart in drift path (compose_selfheal.go) — rspamd+postfix now restart whenever dkimCopied is true, merged (deduped) into the post-ApplyComposeServices restart set, since that call can no-op them when image+compose-config are unchanged.

CI re-running on the new commit.

Copilot AI review requested due to automatic review settings June 13, 2026 10:16

Copilot started reviewing on behalf of oakshape June 13, 2026 10:16 View session

Copilot AI reviewed Jun 13, 2026

View reviewed changes

oakshape merged commit 2348942 into main Jun 13, 2026
7 checks passed

oakshape deleted the fix/dkim-selfheal-migration-v0.1.28 branch June 13, 2026 10:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(orchestrator): DKIM volume→host migration runs on every self-heal pass + busybox-safe copy#73

fix(orchestrator): DKIM volume→host migration runs on every self-heal pass + busybox-safe copy#73
oakshape merged 2 commits into
mainfrom
fix/dkim-selfheal-migration-v0.1.28

oakshape commented Jun 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

oakshape commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		"sh", "-c",
		`set -e; for d in /from/*/; do [ -e "$d" ] \|\| continue; n=$(basename "$d"); if [ ! -e "/to/$n" ]; then cp -a "/from/$n" "/to/$n" && echo "copied:$n"; fi; done`,

Conversation

oakshape commented Jun 13, 2026

Summary

1. Migration gated behind drift it never sees on upgrade

2. cp -an /from/. /to/ copies nothing under busybox

Blast radius

Testing

Rollout

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

oakshape commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

2. `cp -an /from/. /to/` copies nothing under busybox