fix: register sandbox rollback defer before workload create to plug placeholder leak#299
fix: register sandbox rollback defer before workload create to plug placeholder leak#299Sanchit2662 wants to merge 2 commits intovolcano-sh:mainfrom
Conversation
The placeholder written by StoreSandbox is only cleaned up via sandboxRollback, but the defer was registered AFTER the Sandbox/ SandboxClaim Create. Any Create error (apiserver throttling, ResourceQuota rejection, admission webhook timeout) returned with the placeholder still in Redis, polluting the GC candidate budget until the per-sandbox IdleTimeout window finally reaped it. Move the defer above the workload Create. sandboxRollback is idempotent (NotFound is swallowed; store delete is safe on missing keys), so registering before the workload exists is harmless on success and correct on every failure path. Follow-up to volcano-sh#258, which closed the same invariant gap on the 2-minute ready-wait timeout. Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @Sanchit2662! It looks like this is your first PR to volcano-sh/agentcube 🎉 |
There was a problem hiding this comment.
Pull request overview
Fixes a sandbox-creation rollback gap in the Workload Manager so the Redis/ValKey placeholder written at the start of createSandbox is reliably cleaned up even when the Kubernetes Sandbox/SandboxClaim Create call fails.
Changes:
- Move
sandboxRollbackdefer to immediately after successfully storing the sandbox placeholder (before any subsequent error-returning steps). - Update
TestServerCreateSandboxcases to expect rollback (including store placeholder deletion) on Sandbox/SandboxClaim create failures. - Extend the test fake store to count
DeleteSandboxBySessionIDcalls and patchdeleteSandboxClaimin tests to avoid nil-client panics on rollback paths.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| pkg/workloadmanager/handlers.go | Registers rollback earlier to prevent store placeholder leaks when create fails early. |
| pkg/workloadmanager/handlers_test.go | Updates unit tests/fakes to assert rollback + store cleanup and cover claim rollback path. |
Comments suppressed due to low confidence (1)
pkg/workloadmanager/handlers.go:171
- In the SandboxClaim create error path, the error is re-wrapped using
%v(string interpolation), which discards the original error forerrors.Is/Asand breaks%wwrapping conventions used elsewhere in this function. Prefer wrapping with%wso callers/logging can preserve the underlying error chain.
if err := createSandboxClaim(ctx, dynamicClient, sandboxClaim); err != nil {
err = api.NewInternalError(fmt.Errorf("create sandbox claim %s/%s failed: %v", sandboxClaim.Namespace, sandboxClaim.Name, err))
return nil, err
There was a problem hiding this comment.
Code Review
This pull request moves the registration of the sandbox rollback defer function to an earlier point in the createSandbox handler to ensure that store placeholders and Kubernetes resources are cleaned up immediately if subsequent steps fail. Additionally, the unit tests have been updated to verify that rollback mechanisms, including store and claim deletions, are correctly triggered during various failure scenarios. I have no feedback to provide.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #299 +/- ##
==========================================
+ Coverage 43.37% 47.64% +4.26%
==========================================
Files 30 30
Lines 2610 2819 +209
==========================================
+ Hits 1132 1343 +211
+ Misses 1355 1336 -19
- Partials 123 140 +17
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| return nil, err | ||
| } | ||
|
|
||
| // Register rollback IMMEDIATELY after the placeholder is committed, before |
There was a problem hiding this comment.
Nit: can we trim this comment down a bit? The key point is that rollback must be registered right after the store placeholder exists, and that it is safe before the workload exists because deletes ignore NotFound. The longer GC/create-path explanation is already covered by the PR description.
Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
What type of PR is this?
/kind bug
What this PR does / why we need it
So I was reading through
createSandboxin handlers.go and noticed something that looked off. The function writes a placeholder to Redis at the very top (so the GC has something to clean up if creation goes wrong later), and the only thing that ever deletes that placeholder issandboxRollbackviaDeleteSandboxBySessionID. The problem is the rollbackdeferwas sitting below the actual Sandbox/SandboxClaim Create calls, which means if the Create itself failed, the function returned with the placeholder still in Redis and the defer never got registered.Here's roughly what it looked like before:
That leaked row hangs around in the
session:expiryandsession:last_activityZSets until the per-sandboxIdleTimeoutwindow passes (15 minutes by default). And the whole time it's there, it eats into the GC's 100-row candidate budget per cycle. So if you're hitting Create errors a lot (apiserver throttling, ResourceQuota rejecting things, slow admission webhook, that kind of stuff), the GC ends up busy chasing phantom rows while real idle sandboxes have to wait.The fix is just to move the defer up so it runs before the workload Create. I checked,
sandboxRollbackis already idempotent.deleteSandboxanddeleteSandboxClaimboth swallow NotFound, andDeleteSandboxBySessionIDis fine on a key that doesn't exist. So registering the defer earlier doesn't hurt anything if Create succeeds, and it actually does its job if Create fails.This is basically a follow-up to #258. That PR fixed the same kind of issue for the 2 minute ready-wait timeout, but the two error returns above the defer in the Create block were still uncovered. This closes those.
Notes for the reviewer
A couple of the existing test cases in
TestServerCreateSandboxwere actually asserting the old buggy behavior. Specificallysandbox creation failsandsandbox claim creation failswere checking that no rollback ran. I updated them to expect the rollback now (sodeleteSandbox/deleteSandboxClaimplusDeleteSandboxBySessionIDget called), and added the same store-delete check to the other rollback cases too.I also had to add a
deleteSandboxClaimpatch in the test setup, otherwise the claim-failure case panics because the rollback path calls into it with a nil dynamic client. The fakeStore now countsDeleteSandboxBySessionIDcalls so we can actually verify the placeholder gets cleaned up.The
store placeholder failscase still asserts zero rollback calls, which is correct. The defer is registered afterStoreSandboxon purpose, since if writing the placeholder itself failed there's nothing to roll back.Does this PR introduce a user-facing change?
No, no API or behavior changes for users. Folks running with Restricted NetworkPolicy or under apiserver pressure should just see
session:*cardinality stay flat and the GC finish its work faster when Creates are failing.