feat(gateway): implement wake-on-traffic for paused sandboxes#586
feat(gateway): implement wake-on-traffic for paused sandboxes#586furykerry wants to merge 10 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
56f1d12 to
3c12d0a
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #586 +/- ##
==========================================
- Coverage 79.82% 79.66% -0.16%
==========================================
Files 202 203 +1
Lines 14791 14971 +180
==========================================
+ Hits 11807 11927 +120
- Misses 2553 2608 +55
- Partials 431 436 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
9d34b7d to
9801b3b
Compare
- Add Envoy Go HTTP filter for header-based sandbox routing with wake-on-traffic support for paused sandboxes - Gateway controller watches Sandbox CRDs and maintains a route registry for fast pod IP lookups - Waker component resumes paused sandboxes on incoming traffic using the existing sandbox-manager Resume implementation - Peer route sync propagates route changes across gateway instances - Filter config supports enable-wake-on-traffic and wake-timeout - E2E test validates the full wake-on-traffic flow - Fix filter to detect stale route state: when the controller auto-pauses a sandbox without syncing routes, the gateway registry may still show 'running'. The filter now verifies the actual sandbox state from the informer cache when WakeOnTraffic is enabled. - Add retry logic in Wake for SandboxIsPausing: the controller auto-pause sets Spec.Paused=true immediately, but checkpointing takes time. The waker now retries Resume until the pause completes or the context expires. Signed-off-by: 守辰 <shouchen.zz@alibaba-inc.com>
9801b3b to
fc80c68
Compare
Instead of relying on a short auto-pause timeout (30s) and waiting for
the controller to trigger auto-pause, manually pause the sandbox via
the E2B API (POST /sandboxes/{id}/pause) using sbx.beta_pause().
This makes the test more deterministic and avoids the race condition
where the controller's auto-pause sets Spec.Paused=true while the
checkpoint is still in progress (SandboxIsPausing state).
Changes:
- Increase sandbox timeout from 30s to 300s
- Call sbx.beta_pause() immediately after creation instead of waiting
for auto-pause
- Keep on_timeout='pause' in lifecycle (required by server validation:
autoResume requires autoPause)
- Remove the wait-for-auto-pause polling loop
Signed-off-by: 守辰 <shouchen.zz@alibaba-inc.com>
Add Waker.ResolveRoute as a single, general-purpose staleness resolver that replaces IsSandboxPaused and HasWakeAnnotation fallback logic. ResolveRoute fetches the sandbox from the informer cache and builds a fresh Route via proxyutils.GetRouteFromSandbox. The filter uses this as its single point of truth for route state, eliminating the complex conditional logic that previously decided when to consult the informer cache. Changes: - Add Waker.ResolveRoute(ctx, sandboxID, route) -> (Route, bool) - Remove Waker.IsSandboxPaused (replaced by ResolveRoute) - Simplify filter DecodeHeaders: call ResolveRoute once after registry lookup, use the resolved route for all state checks - Remove HasWakeAnnotation fallback in wake path (ResolveRoute provides fresh WakeOnTraffic from informer when available) - Update tests: add TestResolveRoute, update cache fallback tests Signed-off-by: 守辰 <shouchen.zz@alibaba-inc.com>
05f4bba to
46fd168
Compare
Previously, handleRefresh deleted routes from the registry for any non-running state, including paused. This meant when wake-on-traffic arrived, the filter couldn't find the sandbox and returned 502. Changes: - handleRefresh: Keep paused routes in the registry (same as running). Only delete for truly non-routable states (dead, creating, available). - filter: Remove ResolveRoute fallback on registry miss and the post-hit ResolveRoute refresh. The registry now always has the correct route state via peer sync, so the filter relies on it directly. - Added TestHandleRefresh_PausedState to verify paused routes are preserved in the registry with correct state and WakeOnTraffic flag. Signed-off-by: 守辰 <shouchen.zz@alibaba-inc.com>
46fd168 to
02d8fdc
Compare
These functions were introduced as informer fallbacks for registry misses and non-routable states, but are no longer needed now that handleRefresh retains paused routes in the registry. The filter now relies solely on the registry for route state, making both helpers dead code. - Remove ResolveRoute(ctx, sandboxID, route) from wake.go - Remove HasWakeAnnotation(ctx, ns, name) from wake.go - Remove TestResolveRoute and TestHasWakeAnnotation from wake_test.go - Remove unused proxyutils and proxy imports from wake.go and wake_test.go Signed-off-by: 守辰 <shouchen.zz@alibaba-inc.com>
Add tests for the wake execution paths in DecodeHeaders: - TestDecodeHeadersWakeOnTrafficWakeFails: verifies 503 when Wake() errors - TestDecodeHeadersWakeOnTrafficWakeSucceeds: verifies Continue when Wake() succeeds These tests use cachetest.NewTestCache with mock manager to simulate the full wake flow through the filter, improving patch coverage. Signed-off-by: 守辰 <shouchen.zz@alibaba-inc.com>
The sandbox-gateway cache was caching 7+ resource types (Sandbox, SandboxSet, Checkpoint, SandboxTemplate, PersistentVolume, Secret, ConfigMap) but only needs Sandbox resources. Changes: - Add NewCacheOptions with SandboxOnly flag to NewCache() - Add BuildGatewayCacheConfig() for ByObject cache filtering - Add SetupSandboxCacheControllersWithManager() for minimal controller setup - Add addIndexesToCache() to filter indexes by resource type - Update gateway controller to use sandbox-only cache - Add nil guards on getter methods for SandboxOnly mode - Add unit test for BuildGatewayCacheConfig Signed-off-by: 守辰 <shouchen.zz@alibaba-inc.com>
Signed-off-by: 守辰 <shouchen.zz@alibaba-inc.com>
- Export sentinel errors (ErrSandboxIsPausing, ErrSandboxIsTerminating, ErrShutdownTimeReached, ErrSandboxPhaseNotAllowed) from pkg/utils - Change IsSandboxResumable signature from (bool, string) to (bool, error) - Add Unwrap() method and NewErrorWrap constructor to managererrors.Error to enable errors.Is() chain through wrapped errors - Update Resume() in sandboxcr to wrap sentinel errors via NewErrorWrap - Update wake.go to use errors.Is(resumeErr, utils.ErrSandboxIsPausing) instead of strings.Contains(resumeErr.Error(), "SandboxIsPausing") - Update cache/tasks.go caller for new signature - Update TestIsSandboxResumable to use assert.ErrorIs Also fix handleRefresh to only delete routes from registry when state is dead or empty, keeping creating/available routes for full filter visibility. Signed-off-by: 守辰 <shouchen.zz@alibaba-inc.com>
- Merge wake annotations into ResumeOptions.Annotations, written atomically in Resume's retryUpdate to reduce API server writes - Remove separate updateConnectTimeout annotations parameter; it now only handles timeout with ExtendOnly policy - Optimize buildWakeAnnotations to defer map allocation until needed - Demote per-request diagnostic logs to Debug level: - Wake eligibility check in filter.DecodeHeaders - Skip resetting timeout for never-timeout sandbox - Add TestConfigParserParseUnmarshalError for json.Unmarshal error path - Coverage: filter package 94.5% → 95.5%, Parse 81.2% → 87.5% Signed-off-by: 守辰 <shouchen.zz@alibaba-inc.com>
Ⅰ. Describe what this PR does
This PR implements wake-on-traffic for paused sandboxes in the sandbox-gateway. When a paused sandbox receives incoming traffic, the gateway automatically resumes it and forwards the request once the sandbox is running again.
Key changes:
pkg/sandbox-gateway/filter/filter.go): When a paused sandbox withWakeOnTraffic=truereceives traffic, the filter callsWaker.Wake()to resume it. Returns 503 if wake fails, 502 if not wakeable. Per-request diagnostic logs (Wake eligibility check) are at Debug level to reduce noise.wakepackage (pkg/sandbox-gateway/wake/wake.go): ImplementsWakerwith retry logic for transient states (e.g.,SandboxIsPausing). Callssandbox.Resume()to patch the spec and waits for the sandbox to reach Running state via the cache wait reconciler. Useserrors.Is()with sentinel errors for error classification (not string matching).pkg/servers/e2b/pause_resume.go): AddsPause()that setsSpec.DesiredState=Paused(instead of deleting the sandbox), enabling the wake-on-traffic flow.buildWakeAnnotationsdefers map allocation until annotations are actually needed.pkg/sandbox-manager/infra/sandboxcr/sandbox.go): Wake annotations are passed viaResumeOptions.Annotationsand written atomically in Resume'sretryUpdate, reducing API server writes.updateConnectTimeoutonly handles timeout with ExtendOnly policy.pkg/sandbox-gateway/server/server.go):handleRefreshretains routes for all states exceptdeadand empty string. This ensures the filter has full visibility (includingpaused,creating,available) for routing and wake decisions.pkg/utils/utils.go,pkg/sandbox-manager/errors/error.go): Exports sentinel errors (ErrSandboxIsPausing,ErrSandboxIsTerminating,ErrShutdownTimeReached,ErrSandboxPhaseNotAllowed) fromIsSandboxResumable. AddsUnwrap()+NewErrorWrap()tomanagererrors.Errorto enableerrors.Is()through wrapped errors. Replacesstrings.Contains()inwake.gowith propererrors.Is().pkg/cache/cache.go): Restricts the gateway's informer cache to only Sandbox resources viaByObjectfiltering, reducing memory usage and API server load. AddsNewCacheOptions.SandboxOnly,BuildGatewayCacheConfig(), andSetupSandboxCacheControllersWithManager().ENABLE_WAKE_ON_TRAFFICenv var andwakeTimeoutSecondsconfig to the gateway filter.Ⅱ. Does this pull request fix one issue?
NONE
Ⅲ. Describe how to verify it
go test ./pkg/sandbox-gateway/... ./pkg/cache/... ./pkg/servers/e2b/... ./pkg/utils/...POST /sandboxes/{id}/pauseⅣ. Special notes for reviews
ResumeOptions.Annotationsand written atomically in Resume'sretryUpdate.updateConnectTimeoutno longer accepts annotations — it only handles timeout with ExtendOnly policy. This reduces the number of API server writes from 3 to 1-2 in the ConnectSandbox (paused) path.IsSandboxResumablenow returns(bool, error)instead of(bool, string). Callers useerrors.Is()for classification. Themanagererrors.Errortype gainsUnwrap()to supporterrors.Is()throughNewErrorWrap.deadand empty-state routes are deleted from the registry. All other states (running,paused,creating,available) are retained for full filter visibility.Waker.Wake()retry loop only retries onErrSandboxIsPausing(transient state). Other errors likeErrSandboxIsTerminatingare non-retryable and return immediately.SandboxOnlymode) adds nil guards onGetSandboxController()andGetSandboxSetController()since controller handlers are not set up in that mode.Pause()now uses spec patch (Spec.DesiredState=Paused) instead of deleting the sandbox, which is required for the wake-on-traffic feature..V(utils.DebugLogLevel)) to reduce noise in production.Appendix: Incidental Changes
hack/run-e2b-e2e-test.sh: Added wake-on-traffic test to the E2B test suitepkg/proxy/routes.go: Minor route handling adjustmentspkg/utils/proxyutils/route.go: AddedHasWakeAnnotationhelper