Skip to content

Fix old-primary recovery lifecycle#74

Merged
colinmollenhour merged 1 commit intomainfrom
uber-code/wishlist-36-38
May 7, 2026
Merged

Fix old-primary recovery lifecycle#74
colinmollenhour merged 1 commit intomainfrom
uber-code/wishlist-36-38

Conversation

@colinmollenhour
Copy link
Copy Markdown
Collaborator

AI Megamind - By: OpenCode / openai/gpt-5.5

Summary

  • Adds a CR-visible RecoveryInProgress lifecycle for no-divergence old-primary recovery, including restart rehydration and retry/stabilization handling.
  • Clears recovery state only after healthy replication is observed so CR replicating/gtidExecuted enrichment becomes visible after recovery.
  • Documents GTID-freshest/current-state-driven fail-back behavior and updates CRD/docs/WISHLIST state.

Test plan

  • make generate && make manifests && cp config/crd/bases/shipstream.io_mysqlfailovergroups.yaml charts/bloodraven/crds/shipstream.io_mysqlfailovergroups.yaml — passed
  • make vet — passed
  • PATH="/home/colin/go/bin:$PATH" make lint — passed
  • make test — passed
  • go install sigs.k8s.io/controller-runtime/tools/setup-envtest@latest && KUBEBUILDER_ASSETS=$(/home/colin/go/bin/setup-envtest use --bin-dir /tmp/envtest-bin -p path) make test-envtest — passed
  • cd docs && npm ci && npm run build — passed
  • git diff --check — passed

Megamind artifacts

  • Plan: .tmp/uber-code-wishlist-36-38/plans/final.md
  • Validated findings: .tmp/uber-code-wishlist-36-38/reviews/validated-findings.md
  • Fixed review: .tmp/uber-code-wishlist-36-38/reviews/fixed-review.md
  • Local gates: .tmp/uber-code-wishlist-36-38/final/local-gates.md

Copilot AI review requested due to automatic review settings May 7, 2026 08:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a CR-visible old-primary recovery lifecycle (RecoveryInProgress) so that no-divergence recovery persists across operator restarts and only clears once replication is confirmed healthy, ensuring status.sites[].replicating / gtidExecuted enrichment resumes reliably after recovery.

Changes:

  • Introduces and persists RecoveryInProgress alongside RecoveryBlocked, with a stabilization/retry window and restart rehydration.
  • Defers clearing recovery state until MySQL reports healthy replication, preventing post-recovery status enrichment from stalling.
  • Updates component/unit tests, CRD/API schema, and docs to reflect the new recovery lifecycle and GTID/current-state-driven fail-back behavior.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.

Show a summary per file
File Description
WISHLIST.md Marks related wishlist items done and summarizes the new recovery lifecycle contract.
test/component/recovery_test.go Extends recovery test to model replication becoming healthy before clearing recovery state.
playground/chaos-scenarios.md Documents GTID/current-state-driven fail-back and updates chaos hypotheses for recovery durability.
internal/controller/topology.go Implements RecoveryInProgress, stabilization delay, and delayed clearing until replication is healthy.
internal/controller/topology_test.go Adds unit coverage for in-progress persistence, suppression window, and restart retry behavior.
internal/controller/runner.go Rehydrates recovery state on restart and reports RecoveryPending condition for in-progress vs blocked.
docs/docs/operations.mdx Updates operational guidance for both in-progress recovery and divergent recovery.
docs/docs/monitoring.mdx Splits RecoveryPending=True guidance by reason (RecoveryInProgress vs DivergentTransactions).
docs/docs/failover.mdx Documents the concrete recovery sequence and the in-progress lifecycle visibility.
docs/docs/crd-reference.mdx Updates CRD reference for recoveryState and RecoveryPending semantics.
config/crd/bases/shipstream.io_mysqlfailovergroups.yaml Extends CRD enum/docs to include RecoveryInProgress.
charts/bloodraven/crds/shipstream.io_mysqlfailovergroups.yaml Mirrors CRD enum/docs update for Helm distribution.
api/v1alpha1/types.go Updates kubebuilder validation enum and field docs for RecoveryInProgress.
AGENTS.md Updates documented operator behaviors during testing (rehydration + GTID/current-state-driven fail-back).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@colinmollenhour colinmollenhour merged commit 5ba7978 into main May 7, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants