Skip to content

fix: register sandbox rollback defer before workload create to plug placeholder leak#299

Open
Sanchit2662 wants to merge 2 commits intovolcano-sh:mainfrom
Sanchit2662:fix/createsandbox-placeholder-leak
Open

fix: register sandbox rollback defer before workload create to plug placeholder leak#299
Sanchit2662 wants to merge 2 commits intovolcano-sh:mainfrom
Sanchit2662:fix/createsandbox-placeholder-leak

Conversation

@Sanchit2662
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it

So I was reading through createSandbox in handlers.go and noticed something that looked off. The function writes a placeholder to Redis at the very top (so the GC has something to clean up if creation goes wrong later), and the only thing that ever deletes that placeholder is sandboxRollback via DeleteSandboxBySessionID. The problem is the rollback defer was sitting below the actual Sandbox/SandboxClaim Create calls, which means if the Create itself failed, the function returned with the placeholder still in Redis and the defer never got registered.

Here's roughly what it looked like before:

sandboxStorePlaceHolder := buildSandboxPlaceHolder(sandbox, sandboxEntry)
if err := s.storeClient.StoreSandbox(ctx, sandboxStorePlaceHolder); err != nil {
    return nil, ...
}

if sandboxClaim != nil {
    if err := createSandboxClaim(ctx, dynamicClient, sandboxClaim); err != nil {
        return nil, ...   // placeholder leaked, defer wasn't set up yet
    }
} else {
    if _, err := createSandbox(ctx, dynamicClient, sandbox); err != nil {
        return nil, ...   // same thing here
    }
}

needRollbackSandbox := true
defer func() { ... s.sandboxRollback(...) }()

That leaked row hangs around in the session:expiry and session:last_activity ZSets until the per-sandbox IdleTimeout window passes (15 minutes by default). And the whole time it's there, it eats into the GC's 100-row candidate budget per cycle. So if you're hitting Create errors a lot (apiserver throttling, ResourceQuota rejecting things, slow admission webhook, that kind of stuff), the GC ends up busy chasing phantom rows while real idle sandboxes have to wait.

The fix is just to move the defer up so it runs before the workload Create. I checked, sandboxRollback is already idempotent. deleteSandbox and deleteSandboxClaim both swallow NotFound, and DeleteSandboxBySessionID is fine on a key that doesn't exist. So registering the defer earlier doesn't hurt anything if Create succeeds, and it actually does its job if Create fails.

This is basically a follow-up to #258. That PR fixed the same kind of issue for the 2 minute ready-wait timeout, but the two error returns above the defer in the Create block were still uncovered. This closes those.

Notes for the reviewer

A couple of the existing test cases in TestServerCreateSandbox were actually asserting the old buggy behavior. Specifically sandbox creation fails and sandbox claim creation fails were checking that no rollback ran. I updated them to expect the rollback now (so deleteSandbox / deleteSandboxClaim plus DeleteSandboxBySessionID get called), and added the same store-delete check to the other rollback cases too.

I also had to add a deleteSandboxClaim patch in the test setup, otherwise the claim-failure case panics because the rollback path calls into it with a nil dynamic client. The fakeStore now counts DeleteSandboxBySessionID calls so we can actually verify the placeholder gets cleaned up.

The store placeholder fails case still asserts zero rollback calls, which is correct. The defer is registered after StoreSandbox on purpose, since if writing the placeholder itself failed there's nothing to roll back.

Does this PR introduce a user-facing change?

No, no API or behavior changes for users. Folks running with Restricted NetworkPolicy or under apiserver pressure should just see session:* cardinality stay flat and the GC finish its work faster when Creates are failing.

NONE

The placeholder written by StoreSandbox is only cleaned up via
sandboxRollback, but the defer was registered AFTER the Sandbox/
SandboxClaim Create. Any Create error (apiserver throttling,
ResourceQuota rejection, admission webhook timeout) returned with
the placeholder still in Redis, polluting the GC candidate budget
until the per-sandbox IdleTimeout window finally reaped it.

Move the defer above the workload Create. sandboxRollback is
idempotent (NotFound is swallowed; store delete is safe on missing
keys), so registering before the workload exists is harmless on
success and correct on every failure path.

Follow-up to volcano-sh#258, which closed the same invariant gap on the
2-minute ready-wait timeout.

Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
Copilot AI review requested due to automatic review settings April 27, 2026 16:41
@volcano-sh-bot volcano-sh-bot added the kind/bug Something isn't working label Apr 27, 2026
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot
Copy link
Copy Markdown
Contributor

Welcome @Sanchit2662! It looks like this is your first PR to volcano-sh/agentcube 🎉

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a sandbox-creation rollback gap in the Workload Manager so the Redis/ValKey placeholder written at the start of createSandbox is reliably cleaned up even when the Kubernetes Sandbox/SandboxClaim Create call fails.

Changes:

  • Move sandboxRollback defer to immediately after successfully storing the sandbox placeholder (before any subsequent error-returning steps).
  • Update TestServerCreateSandbox cases to expect rollback (including store placeholder deletion) on Sandbox/SandboxClaim create failures.
  • Extend the test fake store to count DeleteSandboxBySessionID calls and patch deleteSandboxClaim in tests to avoid nil-client panics on rollback paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
pkg/workloadmanager/handlers.go Registers rollback earlier to prevent store placeholder leaks when create fails early.
pkg/workloadmanager/handlers_test.go Updates unit tests/fakes to assert rollback + store cleanup and cover claim rollback path.
Comments suppressed due to low confidence (1)

pkg/workloadmanager/handlers.go:171

  • In the SandboxClaim create error path, the error is re-wrapped using %v (string interpolation), which discards the original error for errors.Is/As and breaks %w wrapping conventions used elsewhere in this function. Prefer wrapping with %w so callers/logging can preserve the underlying error chain.
		if err := createSandboxClaim(ctx, dynamicClient, sandboxClaim); err != nil {
			err = api.NewInternalError(fmt.Errorf("create sandbox claim %s/%s failed: %v", sandboxClaim.Namespace, sandboxClaim.Name, err))
			return nil, err

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request moves the registration of the sandbox rollback defer function to an earlier point in the createSandbox handler to ensure that store placeholders and Kubernetes resources are cleaned up immediately if subsequent steps fail. Additionally, the unit tests have been updated to verify that rollback mechanisms, including store and claim deletions, are correctly triggered during various failure scenarios. I have no feedback to provide.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 27, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 47.64%. Comparing base (57f6d84) to head (3422d40).
⚠️ Report is 70 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #299      +/-   ##
==========================================
+ Coverage   43.37%   47.64%   +4.26%     
==========================================
  Files          30       30              
  Lines        2610     2819     +209     
==========================================
+ Hits         1132     1343     +211     
+ Misses       1355     1336      -19     
- Partials      123      140      +17     
Flag Coverage Δ
unittests 47.64% <100.00%> (+4.26%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread pkg/workloadmanager/handlers.go Outdated
return nil, err
}

// Register rollback IMMEDIATELY after the placeholder is committed, before
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we trim this comment down a bit? The key point is that rollback must be registered right after the store placeholder exists, and that it is safe before the workload exists because deletes ignore NotFound. The longer GC/create-path explanation is already covered by the PR description.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants