fix(CubeProxy): disable buffering for envd streaming endpoints#647
fix(CubeProxy): disable buffering for envd streaming endpoints#647zyl1121 wants to merge 1 commit into
Conversation
Review: PR #647 - CubeProxy nginx streaming buffering fixThe change is functionally correct and well-scoped. proxy_buffering off is the right fix for gRPC server-streaming endpoints where nginx buffering batches frames until buffer-full or timeout. 1. Sandbox path bypasses unbuffered endpoints (medium)Nginx ^~ prefix locations take priority over regex. Requests to /sandbox///process.Process/Start match ^~ /sandbox/ (lines 138/272) and inherit global proxy_buffering on. If streaming endpoints are only host-routed this is fine; otherwise the ^~ /sandbox/ blocks need proxy_buffering off too. 2. Slow-client exhaustion risk (medium)proxy_buffering off + 7206s timeouts means slow-reading clients hold upstream connections for up to 2 hours. Consider tighter timeouts or application-level heartbeats. 3. proxy_intercept_errors not effective (low)With buffering off, upstream error details pass through verbatim. Add explicit proxy_intercept_errors off to make intent visible.
|
956aed3 to
9168339
Compare
9168339 to
1f3633a
Compare
chenhengqi
left a comment
There was a problem hiding this comment.
I have a similar attempt in #577, but we consider it suboptimal anyway.
It would be better if we can address it in envd side.
Could you please craft an example along with this PR? Thanks.
Thanks, that makes sense. I added a This PR keeps the proxy-side mitigation narrow for the currently verified streaming paths, and the same repro should also be useful for follow-up on the envd side. |
| # Path-based server-streaming envd endpoints. Keep buffering enabled | ||
| # elsewhere under /sandbox/ and disable it only for these verified | ||
| # response streams. | ||
| location ~ ^/sandbox/[^/]+/\d+/(?:process\.Process/(?:Start|Connect)|filesystem\.Filesystem/WatchDir)$ { |
There was a problem hiding this comment.
Do we need SendInput here?
There was a problem hiding this comment.
Do we need
SendInputhere?
I tested this on a real cluster as a follow-up, and I don't think we need a separate unbuffered SendInput route. SendInput itself is still a unary/input-side call, the real-time output is delivered through the already-unbuffered Start / Connect streaming paths.
In the PTY repro, with the current patch, pty.send_stdin() returned in about 10ms and the PTY marker was streamed back in about 15ms. After removing the unbuffering change for Start / Connect, the serial PTY flow was blocked earlier: pty.create() was buffered until near the connection timeout, and the later connect() / send_stdin() calls failed with NotFoundException. This points to the streaming Start / Connect path as the failure point, not SendInput.
I also attached the repro script I used. After exporting the corresponding SDK environment variables for the target cluster, it can be run directly, for example:
export E2B_API_URL="<your-cube-api-url>"
export E2B_API_KEY="e2b_000000"
export CUBE_TEMPLATE_ID="<your-template-id>"
export SSL_CERT_FILE="/root/.local/share/mkcert/rootCA.pem"
python demo-pty.pyThe script I used: demo-pty.py
|
|
1f3633a to
47dd074
Compare
Disable nginx response buffering for the verified envd server-streaming endpoints in CubeProxy. This lets early stream frames be delivered promptly for background commands and watch streams, while keeping global buffering behavior unchanged for other routes. Signed-off-by: zhengyilei <zheng_yilei@qq.com>
47dd074 to
a779149
Compare
|
@kinwin-ustc I updated the PR accordingly. I removed I also re-ran the real-cluster checks through the normal one-click startup path. With the updated template/runtime config and image together, the verified repros still behave as expected. |
|
LGTM. Thanks for your PR. I'd also like to ask if you've used it in a real production environment? What are the application scenarios, and what is the approximate cluster size? |
@kinwin-ustc Thanks. We are not using CubeSandbox as a fully rolled-out production platform yet, but we have already been testing it internally and are planning a small-scale rollout to users. Currently, our test cluster has 3 nodes, and we have tested up to around 500 sandboxes. Our main scenario is short-lived code execution for agents: the agent itself runs outside the sandbox, and only creates a sandbox when it needs to execute code or perform file operations. We initially started exploring CubeSandbox for SWE-bench / RL-style training scenarios as well. However, with the current template mechanism, each image needs to be templated and distributed to the nodes. When the number of images becomes large, potentially thousands in our evaluation, this creates significant storage and management pressure. We also have some internal long-running agent sandbox scenarios, such as Openclaw, but we have not yet migrated those workloads to CubeSandbox. The main reason is that we still need stronger guarantees around sandbox recovery, node failure handling, persistent storage, and observability before relying on it more broadly. On the observability side, I’m also exploring VM-level ideas including possible eBPF-based approaches. Personally, I think CubeSandbox is very promising. That is why I have been actively testing it and submitting PRs, and I’m happy to keep contributing and exploring it further. |
You can use a high-throughput shared storage to store these templates. In addition, CubeSandbox has a fast cold start speed. You can skip the step of creating a template based on an image and directly use the image to cold start the MVM. This will greatly reduce storage overhead and take advantage of CubeSandbox's high concurrency capabilities.
These capabilities are all part of our open source plan. |
|
@kinwin-ustc Thanks for the clarification.
For the image cold-start path, maybe I missed something. Is there any public documentation or example for directly cold-starting an MVM from an OCI image without creating a template first? From the docs and code paths I checked, the current user-facing flow still seems to be
For the shared storage approach, I also want to confirm the recommended layout. Is the expectation that only template/rootfs/snapshot artifacts are stored on shared storage, while per-sandbox writable/COW layers stay on local disks? I’m not sure whether this separation is already supported in the current implementation, or it is more of a future direction. If the whole cubelet data path is expected to be on shared storage, my concern is that sandbox writable data may add significant pressure to the storage client and backend, depending on the backend type, for example NAS/CephFS or object-storage FUSE mounts.
Good to know these capabilities are part of the open-source plan. Is there already a public roadmap or design for per-sandbox / VM-level observability? I’m still at the planning stage, and I’m considering possible eBPF-based approaches inside each MicroVM, with events exported back through a vsock-based path, similar to how sandbox logs are collected in #535. The main areas I’m thinking about are process execution, file activity, and syscall-level auditing. Network activity may already be partially covered by cube-egress / network-agent, so I’m less sure about that part. I’d like to first check whether this overlaps with any existing plan. If I get further with this, I’ll open a separate discussion for the design. |
The open-source version does not directly expose this capability in its workflow.
The most prioritized placement for shared storage should be the image, followed by memory templates. If you need to clone or roll back during runtime and have high performance requirements, it's best to place this part locally or choose a shared storage with CoW capabilities. Temporary storage can be done directly with local drives, persistent storage can use our upcoming Volume SDK, and the backend is a custom distributed storage. The pressure and model of IO determine the choice of backend storage. For image and memory templates, distributed storage has relatively high IO throughput requirements. All of these are not supported in the current version and require some custom selection and development work
This direction is correct. We have already deployed similar solutions internally and implemented higher-performance zero-copy communication channels in the guest kernel, with functions similar to VSOCK but with much higher performance and lower resource consumption |
Motivation
CubeProxy currently enables
proxy_buffering onglobally. This is fine for ordinary unary envd endpoints, but it can delay response frames from envd server-streaming endpoints until the stream completes or the request times out.Two regressions were verified on a real cluster:
sandbox.commands.run(..., background=True)did not return immediately.run()blocked until the command exited, andhandle.wait()returned immediately afterwards.filesystem.Filesystem/WatchDirdid not deliver its initialstartevent or first filesystem event in real time. The first streamed events arrived only near the request timeout, even though the watched file changed much earlier.What Changed
This PR disables nginx response buffering only for the currently verified envd server-streaming endpoints:
process.Process/Startprocess.Process/Connectfilesystem.Filesystem/WatchDirThe change is applied in both CubeProxy data-plane server blocks. The global default buffering behavior remains unchanged for all other routes, keeping the scope narrow.
It covers both CubeProxy entrypoints for those verified streams:
/process.Process/Start/sandbox/<sandbox-id>/<container-port>/process.Process/StartTesting
Real-cluster
background=Truerepro:sandbox.commands.run(..., background=True)still waited for the command to finish before returning, andhandle.wait()returned immediately afterwardsrun()returned promptly after the process started, and the waiting moved back tohandle.wait()Real-cluster
WatchDirrepro:CREATE/WRITEaround the scheduled 1s file write before and after the fixFilesystem.WatchDirstream deliveredstartand the first filesystem event only around the 6s request timeout boundarystartarrived in about 15ms and the first streamed filesystem event arrived around the scheduled 1s file writeFollow-up validation for maintainer question: interactive
Process.Connect+SendInput:Process.Connectand delayed inputecho:ping-from-sendinputwithin the measurement resolution, and the control-file side effect appeared about 220ms laterProcess.Connectunbuffering already covers the interactive input scenario, so a separate unbufferedSendInputroute is not needed based on this validationFollow-up validation for maintainer question: PTY path:
sandbox.pty.create()+sandbox.pty.connect()+sandbox.pty.send_stdin()sandbox.pty.connect()attached promptlysandbox.pty.send_stdin()returned within about 10ms, the PTY stream delivered the marker within about 15ms, and the control-file side effect appeared within about 50mssandbox.pty.create()only returned near the timeout boundary, and the subsequentsandbox.pty.connect()/sandbox.pty.send_stdin()calls failed withNotFoundExceptionStart/Connecttime under buffered proxy behavior, which is stronger evidence for targeting the streaming routes instead of adding a separate unbufferedSendInputrouteExample Repro
The clearest minimal repro for the buffering issue is the low-level envd
Filesystem.WatchDirserver stream.The same sandbox is exercised through two paths:
sandbox.files.watch_dir()keeps working and sees the scheduled file write around the expected 1s pointFilesystem.WatchDirshould emitstartimmediately and then emit the first filesystem event when the file is writtenMinimal repro sketch:
Observed behavior on a real cluster:
WatchDirstream did not deliverstartor the first filesystem event until around the 6s request timeout boundarystartarrived in about 15ms and the first filesystem event arrived around the scheduled 1s write