Skip to content

feat(gateway): implement wake-on-traffic for paused sandboxes#586

Open
furykerry wants to merge 10 commits into
openkruise:masterfrom
furykerry:feat/wake-on-traffic
Open

feat(gateway): implement wake-on-traffic for paused sandboxes#586
furykerry wants to merge 10 commits into
openkruise:masterfrom
furykerry:feat/wake-on-traffic

Conversation

@furykerry

@furykerry furykerry commented Jun 28, 2026

Copy link
Copy Markdown
Member

Ⅰ. Describe what this PR does

This PR implements wake-on-traffic for paused sandboxes in the sandbox-gateway. When a paused sandbox receives incoming traffic, the gateway automatically resumes it and forwards the request once the sandbox is running again.

Key changes:

  1. Wake-on-traffic in filter (pkg/sandbox-gateway/filter/filter.go): When a paused sandbox with WakeOnTraffic=true receives traffic, the filter calls Waker.Wake() to resume it. Returns 503 if wake fails, 502 if not wakeable. Per-request diagnostic logs (Wake eligibility check) are at Debug level to reduce noise.
  2. New wake package (pkg/sandbox-gateway/wake/wake.go): Implements Waker with retry logic for transient states (e.g., SandboxIsPausing). Calls sandbox.Resume() to patch the spec and waits for the sandbox to reach Running state via the cache wait reconciler. Uses errors.Is() with sentinel errors for error classification (not string matching).
  3. E2B pause with spec patch (pkg/servers/e2b/pause_resume.go): Adds Pause() that sets Spec.DesiredState=Paused (instead of deleting the sandbox), enabling the wake-on-traffic flow. buildWakeAnnotations defers map allocation until annotations are actually needed.
  4. Atomic annotation writes (pkg/sandbox-manager/infra/sandboxcr/sandbox.go): Wake annotations are passed via ResumeOptions.Annotations and written atomically in Resume's retryUpdate, reducing API server writes. updateConnectTimeout only handles timeout with ExtendOnly policy.
  5. Registry route lifecycle (pkg/sandbox-gateway/server/server.go): handleRefresh retains routes for all states except dead and empty string. This ensures the filter has full visibility (including paused, creating, available) for routing and wake decisions.
  6. Sentinel error classification (pkg/utils/utils.go, pkg/sandbox-manager/errors/error.go): Exports sentinel errors (ErrSandboxIsPausing, ErrSandboxIsTerminating, ErrShutdownTimeReached, ErrSandboxPhaseNotAllowed) from IsSandboxResumable. Adds Unwrap() + NewErrorWrap() to managererrors.Error to enable errors.Is() through wrapped errors. Replaces strings.Contains() in wake.go with proper errors.Is().
  7. Gateway cache optimization (pkg/cache/cache.go): Restricts the gateway's informer cache to only Sandbox resources via ByObject filtering, reducing memory usage and API server load. Adds NewCacheOptions.SandboxOnly, BuildGatewayCacheConfig(), and SetupSandboxCacheControllersWithManager().
  8. Configuration: Adds ENABLE_WAKE_ON_TRAFFIC env var and wakeTimeoutSeconds config to the gateway filter.

Ⅱ. Does this pull request fix one issue?

NONE

Ⅲ. Describe how to verify it

  1. Run unit tests:
    go test ./pkg/sandbox-gateway/... ./pkg/cache/... ./pkg/servers/e2b/... ./pkg/utils/...
  2. Run E2E test:
    pytest test/e2b/test_wake_on_traffic.py
  3. Manual verification:
    • Create a sandbox, pause it via POST /sandboxes/{id}/pause
    • Send traffic to the paused sandbox
    • Verify the sandbox is automatically resumed and the request is forwarded

Ⅳ. Special notes for reviews

  • Atomic annotation writes: Wake annotations are now passed via ResumeOptions.Annotations and written atomically in Resume's retryUpdate. updateConnectTimeout no longer accepts annotations — it only handles timeout with ExtendOnly policy. This reduces the number of API server writes from 3 to 1-2 in the ConnectSandbox (paused) path.
  • Sentinel errors: IsSandboxResumable now returns (bool, error) instead of (bool, string). Callers use errors.Is() for classification. The managererrors.Error type gains Unwrap() to support errors.Is() through NewErrorWrap.
  • handleRefresh: Only dead and empty-state routes are deleted from the registry. All other states (running, paused, creating, available) are retained for full filter visibility.
  • The Waker.Wake() retry loop only retries on ErrSandboxIsPausing (transient state). Other errors like ErrSandboxIsTerminating are non-retryable and return immediately.
  • The cache optimization (SandboxOnly mode) adds nil guards on GetSandboxController() and GetSandboxSetController() since controller handlers are not set up in that mode.
  • The E2B Pause() now uses spec patch (Spec.DesiredState=Paused) instead of deleting the sandbox, which is required for the wake-on-traffic feature.
  • Log levels: Per-request diagnostic logs ("Wake eligibility check", "skip resetting timeout for never-timeout sandbox") are at Debug level (.V(utils.DebugLogLevel)) to reduce noise in production.

Appendix: Incidental Changes

  • hack/run-e2b-e2e-test.sh: Added wake-on-traffic test to the E2B test suite
  • pkg/proxy/routes.go: Minor route handling adjustments
  • pkg/utils/proxyutils/route.go: Added HasWakeAnnotation helper

@kruise-bot kruise-bot requested review from AiRanthem and zmberg June 28, 2026 06:06
@kruise-bot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from furykerry by writing /assign @furykerry in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov

codecov Bot commented Jun 28, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 67.85714% with 72 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.66%. Comparing base (bfa04b5) to head (eacd10c).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
pkg/cache/cache.go 45.45% 6 Missing and 6 partials ⚠️
pkg/sandbox-manager/infra/sandboxcr/sandbox.go 50.00% 10 Missing and 2 partials ⚠️
pkg/sandbox-gateway/filter/filter.go 78.00% 10 Missing and 1 partial ⚠️
...g/sandbox-gateway/controller/gateway_controller.go 0.00% 10 Missing ⚠️
pkg/sandbox-gateway/wake/wake.go 76.74% 9 Missing and 1 partial ⚠️
pkg/sandbox-manager/errors/error.go 0.00% 6 Missing ⚠️
pkg/cache/index.go 20.00% 3 Missing and 1 partial ⚠️
pkg/sandbox-gateway/server/server.go 66.66% 3 Missing ⚠️
pkg/cache/controllers/cache_controllers.go 0.00% 2 Missing ⚠️
pkg/cache/tasks.go 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #586      +/-   ##
==========================================
- Coverage   79.82%   79.66%   -0.16%     
==========================================
  Files         202      203       +1     
  Lines       14791    14971     +180     
==========================================
+ Hits        11807    11927     +120     
- Misses       2553     2608      +55     
- Partials      431      436       +5     
Flag Coverage Δ
unittests 79.66% <67.85%> (-0.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@furykerry furykerry force-pushed the feat/wake-on-traffic branch 6 times, most recently from 9d34b7d to 9801b3b Compare June 28, 2026 10:55
- Add Envoy Go HTTP filter for header-based sandbox routing with
  wake-on-traffic support for paused sandboxes
- Gateway controller watches Sandbox CRDs and maintains a route
  registry for fast pod IP lookups
- Waker component resumes paused sandboxes on incoming traffic
  using the existing sandbox-manager Resume implementation
- Peer route sync propagates route changes across gateway instances
- Filter config supports enable-wake-on-traffic and wake-timeout
- E2E test validates the full wake-on-traffic flow
- Fix filter to detect stale route state: when the controller
  auto-pauses a sandbox without syncing routes, the gateway
  registry may still show 'running'. The filter now verifies
  the actual sandbox state from the informer cache when
  WakeOnTraffic is enabled.
- Add retry logic in Wake for SandboxIsPausing: the controller
  auto-pause sets Spec.Paused=true immediately, but checkpointing
  takes time. The waker now retries Resume until the pause
  completes or the context expires.

Signed-off-by: 守辰 <shouchen.zz@alibaba-inc.com>
@furykerry furykerry force-pushed the feat/wake-on-traffic branch from 9801b3b to fc80c68 Compare June 28, 2026 11:20
Instead of relying on a short auto-pause timeout (30s) and waiting for
the controller to trigger auto-pause, manually pause the sandbox via
the E2B API (POST /sandboxes/{id}/pause) using sbx.beta_pause().

This makes the test more deterministic and avoids the race condition
where the controller's auto-pause sets Spec.Paused=true while the
checkpoint is still in progress (SandboxIsPausing state).

Changes:
- Increase sandbox timeout from 30s to 300s
- Call sbx.beta_pause() immediately after creation instead of waiting
  for auto-pause
- Keep on_timeout='pause' in lifecycle (required by server validation:
  autoResume requires autoPause)
- Remove the wait-for-auto-pause polling loop

Signed-off-by: 守辰 <shouchen.zz@alibaba-inc.com>
Add Waker.ResolveRoute as a single, general-purpose staleness resolver
that replaces IsSandboxPaused and HasWakeAnnotation fallback logic.

ResolveRoute fetches the sandbox from the informer cache and builds
a fresh Route via proxyutils.GetRouteFromSandbox. The filter uses
this as its single point of truth for route state, eliminating the
complex conditional logic that previously decided when to consult
the informer cache.

Changes:
- Add Waker.ResolveRoute(ctx, sandboxID, route) -> (Route, bool)
- Remove Waker.IsSandboxPaused (replaced by ResolveRoute)
- Simplify filter DecodeHeaders: call ResolveRoute once after
  registry lookup, use the resolved route for all state checks
- Remove HasWakeAnnotation fallback in wake path (ResolveRoute
  provides fresh WakeOnTraffic from informer when available)
- Update tests: add TestResolveRoute, update cache fallback tests

Signed-off-by: 守辰 <shouchen.zz@alibaba-inc.com>
@furykerry furykerry force-pushed the feat/wake-on-traffic branch from 05f4bba to 46fd168 Compare June 28, 2026 13:31
Previously, handleRefresh deleted routes from the registry for any
non-running state, including paused. This meant when wake-on-traffic
arrived, the filter couldn't find the sandbox and returned 502.

Changes:
- handleRefresh: Keep paused routes in the registry (same as running).
  Only delete for truly non-routable states (dead, creating, available).
- filter: Remove ResolveRoute fallback on registry miss and the
  post-hit ResolveRoute refresh. The registry now always has the
  correct route state via peer sync, so the filter relies on it
  directly.
- Added TestHandleRefresh_PausedState to verify paused routes are
  preserved in the registry with correct state and WakeOnTraffic flag.

Signed-off-by: 守辰 <shouchen.zz@alibaba-inc.com>
@furykerry furykerry force-pushed the feat/wake-on-traffic branch from 46fd168 to 02d8fdc Compare June 28, 2026 14:26
These functions were introduced as informer fallbacks for registry misses
and non-routable states, but are no longer needed now that handleRefresh
retains paused routes in the registry. The filter now relies solely on
the registry for route state, making both helpers dead code.

- Remove ResolveRoute(ctx, sandboxID, route) from wake.go
- Remove HasWakeAnnotation(ctx, ns, name) from wake.go
- Remove TestResolveRoute and TestHasWakeAnnotation from wake_test.go
- Remove unused proxyutils and proxy imports from wake.go and wake_test.go

Signed-off-by: 守辰 <shouchen.zz@alibaba-inc.com>
Add tests for the wake execution paths in DecodeHeaders:
- TestDecodeHeadersWakeOnTrafficWakeFails: verifies 503 when Wake() errors
- TestDecodeHeadersWakeOnTrafficWakeSucceeds: verifies Continue when Wake() succeeds

These tests use cachetest.NewTestCache with mock manager to simulate
the full wake flow through the filter, improving patch coverage.

Signed-off-by: 守辰 <shouchen.zz@alibaba-inc.com>
The sandbox-gateway cache was caching 7+ resource types (Sandbox,
SandboxSet, Checkpoint, SandboxTemplate, PersistentVolume, Secret,
ConfigMap) but only needs Sandbox resources.

Changes:
- Add NewCacheOptions with SandboxOnly flag to NewCache()
- Add BuildGatewayCacheConfig() for ByObject cache filtering
- Add SetupSandboxCacheControllersWithManager() for minimal controller setup
- Add addIndexesToCache() to filter indexes by resource type
- Update gateway controller to use sandbox-only cache
- Add nil guards on getter methods for SandboxOnly mode
- Add unit test for BuildGatewayCacheConfig

Signed-off-by: 守辰 <shouchen.zz@alibaba-inc.com>
Signed-off-by: 守辰 <shouchen.zz@alibaba-inc.com>
- Export sentinel errors (ErrSandboxIsPausing, ErrSandboxIsTerminating,
  ErrShutdownTimeReached, ErrSandboxPhaseNotAllowed) from pkg/utils
- Change IsSandboxResumable signature from (bool, string) to (bool, error)
- Add Unwrap() method and NewErrorWrap constructor to managererrors.Error
  to enable errors.Is() chain through wrapped errors
- Update Resume() in sandboxcr to wrap sentinel errors via NewErrorWrap
- Update wake.go to use errors.Is(resumeErr, utils.ErrSandboxIsPausing)
  instead of strings.Contains(resumeErr.Error(), "SandboxIsPausing")
- Update cache/tasks.go caller for new signature
- Update TestIsSandboxResumable to use assert.ErrorIs

Also fix handleRefresh to only delete routes from registry when
state is dead or empty, keeping creating/available routes for
full filter visibility.

Signed-off-by: 守辰 <shouchen.zz@alibaba-inc.com>
- Merge wake annotations into ResumeOptions.Annotations, written
  atomically in Resume's retryUpdate to reduce API server writes
- Remove separate updateConnectTimeout annotations parameter;
  it now only handles timeout with ExtendOnly policy
- Optimize buildWakeAnnotations to defer map allocation until needed
- Demote per-request diagnostic logs to Debug level:
  - Wake eligibility check in filter.DecodeHeaders
  - Skip resetting timeout for never-timeout sandbox
- Add TestConfigParserParseUnmarshalError for json.Unmarshal error path
- Coverage: filter package 94.5% → 95.5%, Parse 81.2% → 87.5%

Signed-off-by: 守辰 <shouchen.zz@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants