feat: Deployment MinReadySeconds progressive delivery strategy#343
feat: Deployment MinReadySeconds progressive delivery strategy#343Xio-Shark wants to merge 22 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #343 +/- ##
==========================================
+ Coverage 51.38% 52.98% +1.60%
==========================================
Files 66 79 +13
Lines 8559 9693 +1134
==========================================
+ Hits 4398 5136 +738
- Misses 3575 3857 +282
- Partials 586 700 +114
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
| release, release.GetObjectKind().GroupVersionKind())) | ||
| inflateDeploymentStrategy(modified) | ||
| patch := client.MergeFromWithOptions(original, client.MergeFromWithOptimisticLock{}) | ||
| return mc.client.Patch(context.TODO(), modified, patch) |
There was a problem hiding this comment.
L78, L106, L133, L269 — Every mc.client.Patch(context.TODO(), ...) in Initialize, UpgradeBatch, Finalize, and ensureInflatedDeploymentStrategy uses context.TODO() instead of the reconcile context, consider refactor the code to accept the context parameter
| rc.recordMinReadyDegraded("MinReadyInitializeFailed", err) | ||
| return err | ||
| } | ||
| rc.recordMinReadyNormal(v1beta1.RolloutConditionMinReadyInitialized, "MinReadyInitialized", "MinReadySeconds strategy initialized") |
There was a problem hiding this comment.
recordMinReadyNormal/recordMinReadyDegraded fire for ALL partition-style releases, not just MinReady. While isMinReadyRelease() guards inside these methods, the check adds unnecessary overhead and recognition burden, consider moving the recording inside Initialize/UpgradeBatch/Finalize method, and remove other calling of recordMinReadyNormal/recordMinReadyDegraded by simply logging some informations
| // maxUnavailable above the batch target is a legal state after a | ||
| // scale-down (HPA or manual) and also self-heals external tampering; | ||
| // converge it back to the target instead of reporting degraded drift. | ||
| klog.Warningf("MinReadyControl.UpgradeBatch[%d]: deployment %v maxUnavailable=%d exceeds target=%d, reducing it to the target", |
There was a problem hiding this comment.
Use klog.WarningS with structured key-value pairs: klog.WarningS(nil, "MinReady maxUnavailable exceeds target, reducing", "batch", ctx.CurrentBatch, "deployment", klog.KObj(mc.object), "maxUnavailable", current, "target", target). Note: the existing control_plane.go uses klog.Infof/klog.Warningf too, so this is a pre-existing pattern, but new code should aim higher.
| // "Canary and BlueGreen cannot both be set"). With BlueGreen==nil guaranteed, | ||
| // Canary!=nil && !EnableExtraWorkloadForCanary is equivalent to the executor's | ||
| // GetRollingStyle()==Partition routing, so both sides agree on MinReady. | ||
| func shouldSkipRecreateMutationForMinReady(rollout *appsv1beta1.Rollout) bool { |
There was a problem hiding this comment.
When the gate is OFF and a MinReady rollout is in progress, the executor correctly routes to MinReady (via annotation fallback), but the webhook's shouldSkipRecreateMutationForMinReady only checks the feature gate
consider also checking with DeploymentStrategyAnnotation symmetric with the executor:
| if err := writeOriginalAnnotations(snapshot, deployment); err != nil { | ||
| return err | ||
| } | ||
| if hasAnyOriginalAnnotation(snapshot.Annotations) { |
There was a problem hiding this comment.
writeOriginalAnnotations already check for hasAnyOriginalAnnotation, consider extract the hasAnyOriginalAnnotation logic out of writeOriginalAnnotations
| return | ||
| } | ||
| maxSurge := intstr.FromInt(1) | ||
| deployment.Spec.Strategy.RollingUpdate.MaxSurge = &maxSurge |
There was a problem hiding this comment.
per discussion in the proposal, it is not necessary to hack the maxSurge
| if ready == nil || ready.Status != corev1.ConditionTrue { | ||
| return false | ||
| } | ||
| return ready.LastTransitionTime.Add(time.Duration(minReadySeconds)*time.Second).Before(now) || |
There was a problem hiding this comment.
Before(now) || Equal(now) computes the same time.Time addition twice. Consider simplify to !After(now)
| klog.Warningf("MinReadyControl.UpgradeBatch[%d]: deployment %v maxUnavailable=%d exceeds target=%d, reducing it to the target", | ||
| ctx.CurrentBatch, klog.KObj(mc.object), current, target) | ||
| } | ||
| original := mc.object.DeepCopy() |
There was a problem hiding this comment.
it is not necessary to deepcopy the original object.
| } | ||
| return nil | ||
| } | ||
| original := mc.object.DeepCopy() |
There was a problem hiding this comment.
it is not necessary to deepcopy the original object.
| if validateInflatedDeploymentStrategy(mc.object) == nil { | ||
| return nil | ||
| } | ||
| original := mc.object.DeepCopy() |
There was a problem hiding this comment.
it is not necessary to deepcopy the original object.
| return nil | ||
| } | ||
|
|
||
| func ensureAllOriginalAnnotations(annotations map[string]string) error { |
There was a problem hiding this comment.
is this func really necessary? parseOriginalInt32 will return error anyway if required key does not exist
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
CalculateBatchContext now counts only updated-revision Ready pods, with a matching-ReplicaSet readyReplicas fallback, instead of Deployment status.readyReplicas which also counted old-revision Ready pods and could mark a batch ready before the new pods were actually ready. List owned pods when RolloutID is set, and require non-empty pods once a batch label is expected. The executor falls back to the Recreate controller when the MinReadySecondsStrategy feature gate is disabled, matching the webhook skip. Add negative unit and integration coverage for these cases. Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Initialize now writes BatchReleaseControlAnnotation to mark the Deployment as controlled by a specific BatchRelease, consistent with the CloneSet batch release pattern. Finalize now cleans up both BatchReleaseControlAnnotation and DeploymentStableRevisionLabel to ensure the Deployment is fully released after rollout completes. Also removes premature user-facing docs (quickstart, migration guide, runbook) that will be added in a follow-up after the feature stabilizes. Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
MinReady (P0/P1/P2 from code review): - P0-2: webhook now enforces RollingUpdate/paused=false/non-nil rollingUpdate invariants for active MinReady rollouts; controller treats paused drift as drift so ensureInflatedDeploymentStrategy self-heals the freeze. - P1-1: introduce sentinel errors and classify degraded reasons via errors.Is instead of matching human-readable message text. - P1-2: UpgradeBatch converges maxUnavailable back to the batch target on scale-down instead of falsely reporting GitOps drift. - P1-4: lift MinReady annotation constants to api/v1beta1; executor keeps routing to MinReady controller (and status keeps reporting) when the feature gate is disabled mid-rollout but the Deployment still carries annotations. - P1-6: EnrollMinReadyDeployment inflates strategy synchronously at admission, closing the window where the native controller could observe the original budget before Initialize lands. - P0-1: add batchLabelSatisfied regression matrix (shared hot path). - P2-3: document the BlueGreen+Canary mutual-exclusion invariant source. Repo audit: - Restrict webhook cert dir to 0700 and private keys to 0600 (certs 0644). - parse_utils: use NestedString to avoid panic on malformed status; surface json marshal/unmarshal errors instead of silently swallowing them. - Align Dockerfile_multiarch Go builder to golang:1.20.14-alpine3.19. - Update proposal: webhook/executor behavior and add operator runbook (feature-gate lifecycle, disable preconditions, controller-death recovery). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
1bfd8cc to
b850c17
Compare
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
b850c17 to
db836f0
Compare
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
| @@ -0,0 +1,140 @@ | |||
| /* | |||
There was a problem hiding this comment.
consider organizing related tests files in a sub-packages
There was a problem hiding this comment.
Kept these tests in the existing e2e package for this PR. test/e2e currently uses a shared suite/client/helper setup and all existing workflows execute the flat test/e2e package; moving MinReady into a sub-package would require a broader e2e harness refactor. I think that is better handled separately from this feature fix.
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
135226e to
590470c
Compare
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Use a 4-step rollout (60% keeps target=3) so TC7 resume no longer jumps past the convergence assertion. Move MinReady E2E into test/e2e/minready, update CI workflows, and emit structured warning logs for maxUnavailable convergence. Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Move MinReady status/metrics recording into MinReadyControl lifecycle methods, simplify control_plane to generic logging for non-MinReady paths, add structured warningS logging, and consolidate original annotation prepare/enroll helpers per review feedback. Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Converge external maxUnavailable tampering during EnsureBatchPodsReadyAndLabeled so TC7 does not depend on UpgradeBatch running only after rollout resume. Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>
2ab1c14 to
8a522e8
Compare
Wait for step 2 pause before asserting post-resume maxUnavailable in TC7 so UpgradeBatch has applied the 60% batch target. Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>
8a522e8 to
9485d09
Compare
| TerminatingReasonInTerminating = "InTerminating" | ||
| TerminatingReasonCompleted = "Completed" | ||
|
|
||
| // MinReadyInitialized indicates MinReadySeconds strategy initialization has completed. |
There was a problem hiding this comment.
is it possible to reuse existing rollout condition such as RolloutConditionProgressing, RolloutConditionSucceeded and RolloutConditionTerminating
Address openkruise#343 review comments: P0-3: implement sliding window in reconcileMaxUnavailable so a large batch target (e.g. 99 after a 1-pod canary) no longer writes maxUnavailable in a single patch. Advance by the user's original maxUnavailable step, waiting for the current window pods to become available (UpdatedReadyReplicas >= current) before widening the budget. maxUnavailable=0 (surge-only) falls back to driving the batch directly. P1-7: converge condition/event/metrics reporting to the outer control_plane trunk via MinReadyLifecycle; lower-layer methods now only return errors instead of recording conditions themselves. P1-8: replace DeletionTimestamp check with util.IsPodActive in countUpdatedAvailablePods, mirroring upstream kubernetes controller_utils.go. Add 3 sliding-window unit tests; all 8 affected packages pass. Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
…o 5) TestDeploymentMinReadyConcurrentScaleUsesLatestReplicas expected the full batch target (10) in one patch. After P0-3, UpgradeBatch advances maxUnavailable one sliding-window step at a time (25% of 20 replicas = 5). Update the assertion and add a comment explaining the sliding-window first-step semantics. Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
| func (mc *MinReadyControl) CalculateBatchContext(release *v1beta1.BatchRelease) (*batchcontext.BatchContext, error) { | ||
| rolloutID := release.Spec.ReleasePlan.RolloutID | ||
| if rolloutID != "" { | ||
| if _, err := mc.ListOwnedPods(); err != nil { |
There was a problem hiding this comment.
minReadyUpdatedReadyReplicas also listownedpods, consider list only once.
|
|
||
| return batchContext.IsBatchReady() | ||
| if reconciler, ok := controller.(MinReadyDriftReconciler); ok { | ||
| if err := reconciler.ReconcileMaxUnavailableDrift(rc.ctx, batchContext); err != nil { |
There was a problem hiding this comment.
ReconcileMaxUnavailableDrift duplicates with UpgradeBatch for MinReadyControl, is it necessary to call ReconcileMaxUnavailableDrift in VerifyingBatchState phase ?
| return mc.patchMaxUnavailable(ctx, int(target)) | ||
| } | ||
| if batchContext.UpdatedReadyReplicas < int32(current) { | ||
| // current window not yet filled; keep the budget and wait for readiness |
There was a problem hiding this comment.
we don't have to wait for the whole window to be updated and ready
| } | ||
| return false | ||
| } | ||
|
|
There was a problem hiding this comment.
there is a pre existing bug in SetDefaultDeploymentStrategy, consider revise it
if strategy.RollingUpdate.MaxSurge == nil {
maxSurge := intstr.FromString("25%")
strategy.RollingUpdate.MaxUnavailable = &maxSurge // BUG: should be MaxSurge
}
|
|
||
| func serializeOriginalInt32(value *int32) string { | ||
| if value == nil { | ||
| return strconv.FormatInt(int64(DefaultProgressDeadlineSeconds), 10) |
There was a problem hiding this comment.
this func is used for both progressdeadlineseconds and MinReadySeconds, so assign the DefaultProgressDeadlineSeconds to MinReadySeconds is wrong (kubernetes default is 0 for min readyseconds), consider pass the default value to the func instead
|
|
||
| appsv1alpha1 "github.com/openkruise/rollouts/api/v1alpha1" | ||
| appsv1beta1 "github.com/openkruise/rollouts/api/v1beta1" | ||
| partitiondeployment "github.com/openkruise/rollouts/pkg/controller/batchrelease/control/partitionstyle/deployment" |
There was a problem hiding this comment.
The webhook now imports partitionstyle/deployment for EnrollMinReadyDeployment/Inflated* constants. There is no import cycle (the controller package does not import webhook), but it inverts the usual layering (webhook depending on controller internals).
Consider move EnrollMinReadyDeployment func and related UT directly to the webhook pacakge
|
|
||
| // warningS logs at warning severity with the same call shape as klog.InfoS/ErrorS. | ||
| // klog v2.120.1 does not expose WarningS, so MinReady uses this helper locally. | ||
| func warningS(err error, msg string, keysAndValues ...interface{}) { |
There was a problem hiding this comment.
The warningS helper formats structured key-value pairs into a flat string and calls klog.Warning(). This loses the structured logging benefits of klog.ErrorS/klog.InfoS (JSON output, log level filtering). klog.V(0).InfoS with err=... achieves similar Warning output with proper structure.
Consider using klog.ErrorS(err, msg, keysAndValues...) for true structured output (errors at error verbosity, warnings via klog.V(n).InfoS).
| err = controller.Initialize(rc.release) | ||
| err = controller.Initialize(rc.ctx, rc.release) | ||
| if err != nil { | ||
| rc.reportOperationFailed(controller, "MinReadyInitializeFailed", err) |
There was a problem hiding this comment.
is it possible to do the logging and metric calculation in the callee using defer, so that the report logic does not influence other control
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
dcbfebd to
f7168a0
Compare
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Summary
Implement a MinReadySeconds-based progressive delivery strategy for native Kubernetes Deployments, as described in proposal #341.
Instead of pausing the Deployment controller (Recreate style), this approach inflates
minReadySecondsandprogressDeadlineSecondsto take control of the rollout pace, then progressively adjustsmaxUnavailableto drive batch-by-batch updates — while the native Deployment controller remains active.Key Design Decisions
MinReadySecondsStrategyfeature gate (per review feedback)minReadySecondsand podReadyconditionLastTransitionTimeto count genuinely available replicas, not just Ready podsInitialize(),ensureInflatedDeploymentStrategy()before each batch, and webhookenforceMinReadyInflation()on every Deployment mutationChanges
Core Controller (
pkg/controller/batchrelease/control/partitionstyle/deployment/)minready_control.go— Initialize / UpgradeBatch / Finalize / CalculateBatchContextminready_batch_context.go— pod availability calculation with original minReadySecondsminready_constants.go— annotations, inflated values, constantsStatus & Metrics
minready_status.go— Rollout condition updates (MinReadyInitialized/Batching/Degraded/Finalized)metrics/minready_metrics.go— Prometheus metrics for batch progress and timingRouting & Webhook
batchrelease_executor.go— feature gate routing to MinReady controllerworkload_update_handler.go— skip Paused mutation, enforce inflation on external editsFeature Gate
MinReadySecondsStrategy(alpha, default off)Tests
Commits
feat: add deployment MinReady strategy— core implementationfix: correct MinReady batch-ready semantics and feature-gate fallback— fix pod counting and executor routingfix: align MinReady implementation with proposal review— address review feedbackfix: write BatchReleaseControlAnnotation and clean up finalize labels— proper controller ownership markingTest Plan
go test ./pkg/...— all unit tests passkubectl rollout statusRelated