Fix old-primary recovery lifecycle#74
Merged
colinmollenhour merged 1 commit intomainfrom May 7, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a CR-visible old-primary recovery lifecycle (RecoveryInProgress) so that no-divergence recovery persists across operator restarts and only clears once replication is confirmed healthy, ensuring status.sites[].replicating / gtidExecuted enrichment resumes reliably after recovery.
Changes:
- Introduces and persists
RecoveryInProgressalongsideRecoveryBlocked, with a stabilization/retry window and restart rehydration. - Defers clearing recovery state until MySQL reports healthy replication, preventing post-recovery status enrichment from stalling.
- Updates component/unit tests, CRD/API schema, and docs to reflect the new recovery lifecycle and GTID/current-state-driven fail-back behavior.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
WISHLIST.md |
Marks related wishlist items done and summarizes the new recovery lifecycle contract. |
test/component/recovery_test.go |
Extends recovery test to model replication becoming healthy before clearing recovery state. |
playground/chaos-scenarios.md |
Documents GTID/current-state-driven fail-back and updates chaos hypotheses for recovery durability. |
internal/controller/topology.go |
Implements RecoveryInProgress, stabilization delay, and delayed clearing until replication is healthy. |
internal/controller/topology_test.go |
Adds unit coverage for in-progress persistence, suppression window, and restart retry behavior. |
internal/controller/runner.go |
Rehydrates recovery state on restart and reports RecoveryPending condition for in-progress vs blocked. |
docs/docs/operations.mdx |
Updates operational guidance for both in-progress recovery and divergent recovery. |
docs/docs/monitoring.mdx |
Splits RecoveryPending=True guidance by reason (RecoveryInProgress vs DivergentTransactions). |
docs/docs/failover.mdx |
Documents the concrete recovery sequence and the in-progress lifecycle visibility. |
docs/docs/crd-reference.mdx |
Updates CRD reference for recoveryState and RecoveryPending semantics. |
config/crd/bases/shipstream.io_mysqlfailovergroups.yaml |
Extends CRD enum/docs to include RecoveryInProgress. |
charts/bloodraven/crds/shipstream.io_mysqlfailovergroups.yaml |
Mirrors CRD enum/docs update for Helm distribution. |
api/v1alpha1/types.go |
Updates kubebuilder validation enum and field docs for RecoveryInProgress. |
AGENTS.md |
Updates documented operator behaviors during testing (rehydration + GTID/current-state-driven fail-back). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
RecoveryInProgresslifecycle for no-divergence old-primary recovery, including restart rehydration and retry/stabilization handling.replicating/gtidExecutedenrichment becomes visible after recovery.Test plan
make generate && make manifests && cp config/crd/bases/shipstream.io_mysqlfailovergroups.yaml charts/bloodraven/crds/shipstream.io_mysqlfailovergroups.yaml— passedmake vet— passedPATH="/home/colin/go/bin:$PATH" make lint— passedmake test— passedgo install sigs.k8s.io/controller-runtime/tools/setup-envtest@latest && KUBEBUILDER_ASSETS=$(/home/colin/go/bin/setup-envtest use --bin-dir /tmp/envtest-bin -p path) make test-envtest— passedcd docs && npm ci && npm run build— passedgit diff --check— passedMegamind artifacts
.tmp/uber-code-wishlist-36-38/plans/final.md.tmp/uber-code-wishlist-36-38/reviews/validated-findings.md.tmp/uber-code-wishlist-36-38/reviews/fixed-review.md.tmp/uber-code-wishlist-36-38/final/local-gates.md