fix(CubeProxy): disable buffering for envd streaming endpoints by zyl1121 · Pull Request #647 · TencentCloud/CubeSandbox

zyl1121 · 2026-06-25T11:22:10Z

Motivation

CubeProxy currently enables proxy_buffering on globally. This is fine for ordinary unary envd endpoints, but it can delay response frames from envd server-streaming endpoints until the stream completes or the request times out.

Two regressions were verified on a real cluster:

sandbox.commands.run(..., background=True) did not return immediately. run() blocked until the command exited, and handle.wait() returned immediately afterwards.
filesystem.Filesystem/WatchDir did not deliver its initial start event or first filesystem event in real time. The first streamed events arrived only near the request timeout, even though the watched file changed much earlier.

What Changed

This PR disables nginx response buffering only for the currently verified envd server-streaming endpoints:

process.Process/Start
process.Process/Connect
filesystem.Filesystem/WatchDir

The change is applied in both CubeProxy data-plane server blocks. The global default buffering behavior remains unchanged for all other routes, keeping the scope narrow.

It covers both CubeProxy entrypoints for those verified streams:

host-routed requests such as /process.Process/Start
path-routed requests such as /sandbox/<sandbox-id>/<container-port>/process.Process/Start

Testing

Real-cluster background=True repro:

before the fix, sandbox.commands.run(..., background=True) still waited for the command to finish before returning, and handle.wait() returned immediately afterwards
after the fix, run() returned promptly after the process started, and the waiting moved back to handle.wait()

Real-cluster WatchDir repro:

the existing watcher path in the SDK continued to deliver CREATE / WRITE around the scheduled 1s file write before and after the fix
before the fix, a direct Filesystem.WatchDir stream delivered start and the first filesystem event only around the 6s request timeout boundary
after the fix, start arrived in about 15ms and the first streamed filesystem event arrived around the scheduled 1s file write

Follow-up validation for maintainer question: interactive Process.Connect + SendInput:

ran a real-cluster repro for an interactive process stream with Process.Connect and delayed input
after the deliberate 1s input delay, the stream delivered echo:ping-from-sendinput within the measurement resolution, and the control-file side effect appeared about 220ms later
the process exited later, confirming the echoed output was streamed before process exit rather than flushed at the end
this indicates the current Process.Connect unbuffering already covers the interactive input scenario, so a separate unbuffered SendInput route is not needed based on this validation

Follow-up validation for maintainer question: PTY path:

ran a separate real-cluster repro through sandbox.pty.create() + sandbox.pty.connect() + sandbox.pty.send_stdin()
with the current patch, sandbox.pty.connect() attached promptly
after the deliberate 1s input delay, sandbox.pty.send_stdin() returned within about 10ms, the PTY stream delivered the marker within about 15ms, and the control-file side effect appeared within about 50ms
the shell exited later, confirming the marker was streamed before process exit rather than flushed at the end
after intentionally removing the unbuffering change, the same PTY repro no longer sustained a usable session: sandbox.pty.create() only returned near the timeout boundary, and the subsequent sandbox.pty.connect() / sandbox.pty.send_stdin() calls failed with NotFoundException
no PTY stream events were delivered and the control-file side effect never appeared in that pre-fix run
this indicates the PTY failure already happens at Start / Connect time under buffered proxy behavior, which is stronger evidence for targeting the streaming routes instead of adding a separate unbuffered SendInput route

Example Repro

The clearest minimal repro for the buffering issue is the low-level envd Filesystem.WatchDir server stream.

The same sandbox is exercised through two paths:

control path: sandbox.files.watch_dir() keeps working and sees the scheduled file write around the expected 1s point
streaming path: direct Filesystem.WatchDir should emit start immediately and then emit the first filesystem event when the file is written

Minimal repro sketch:

watch = sandbox.files.watch_dir("/tmp/watchdir")

stream = sandbox.files._rpc.watch_dir(
    filesystem_pb2.WatchDirRequest(path="/tmp/watchdir", recursive=False),
    headers=headers,
    timeout=6.0,
    request_timeout=sandbox.connection_config.get_request_timeout(6.0),
)

sandbox.commands.run(
    "sh -lc 'nohup sh -lc \"sleep 1; printf stream > /tmp/watchdir/stream.txt\" >/dev/null 2>&1 &'"
)

Observed behavior on a real cluster:

before the fix, the control path still saw the file write around the scheduled 1s point, but the direct WatchDir stream did not deliver start or the first filesystem event until around the 6s request timeout boundary
after the fix, start arrived in about 15ms and the first filesystem event arrived around the scheduled 1s write

cubesandboxbot · 2026-06-25T11:29:17Z

Review: PR #647 - CubeProxy nginx streaming buffering fix

The change is functionally correct and well-scoped. proxy_buffering off is the right fix for gRPC server-streaming endpoints where nginx buffering batches frames until buffer-full or timeout.

1. Sandbox path bypasses unbuffered endpoints (medium)

Nginx ^~ prefix locations take priority over regex. Requests to /sandbox///process.Process/Start match ^~ /sandbox/ (lines 138/272) and inherit global proxy_buffering on. If streaming endpoints are only host-routed this is fine; otherwise the ^~ /sandbox/ blocks need proxy_buffering off too.

2. Slow-client exhaustion risk (medium)

proxy_buffering off + 7206s timeouts means slow-reading clients hold upstream connections for up to 2 hours. Consider tighter timeouts or application-level heartbeats.

3. proxy_intercept_errors not effective (low)

With buffering off, upstream error details pass through verbatim. Add explicit proxy_intercept_errors off to make intent visible.

Issue	Severity
Sandbox path bypasses unbuffered endpoints	Medium
Slow-client exhaustion	Medium
proxy_intercept_errors silently ineffective	Low

chenhengqi

I have a similar attempt in #577, but we consider it suboptimal anyway.
It would be better if we can address it in envd side.

Could you please craft an example along with this PR? Thanks.

zyl1121 · 2026-06-26T02:12:01Z

I have a similar attempt in #577, but we consider it suboptimal anyway.
It would be better if we can address it in envd side.
Could you please craft an example along with this PR? Thanks.

Thanks, that makes sense.

I added a Filesystem.WatchDir repro example to the PR. In the same sandbox, the regular watch helper still sees the scheduled file write in about 1s, while before this fix the low-level WatchDir stream did not deliver events until near the timeout boundary.

This PR keeps the proxy-side mitigation narrow for the currently verified streaming paths, and the same repro should also be useful for follow-up on the envd side.

chenhengqi · 2026-06-26T07:37:27Z

+        # Path-based server-streaming envd endpoints. Keep buffering enabled
+        # elsewhere under /sandbox/ and disable it only for these verified
+        # response streams.
+        location ~ ^/sandbox/[^/]+/\d+/(?:process\.Process/(?:Start|Connect)|filesystem\.Filesystem/WatchDir)$ {


Do we need SendInput here?

Do we need SendInput here?

I tested this on a real cluster as a follow-up, and I don't think we need a separate unbuffered SendInput route. SendInput itself is still a unary/input-side call, the real-time output is delivered through the already-unbuffered Start / Connect streaming paths.

In the PTY repro, with the current patch, pty.send_stdin() returned in about 10ms and the PTY marker was streamed back in about 15ms. After removing the unbuffering change for Start / Connect, the serial PTY flow was blocked earlier: pty.create() was buffered until near the connection timeout, and the later connect() / send_stdin() calls failed with NotFoundException. This points to the streaming Start / Connect path as the failure point, not SendInput.

I also attached the repro script I used. After exporting the corresponding SDK environment variables for the target cluster, it can be run directly, for example:

export E2B_API_URL="<your-cube-api-url>" export E2B_API_KEY="e2b_000000" export CUBE_TEMPLATE_ID="<your-template-id>" export SSL_CERT_FILE="/root/.local/share/mkcert/rootCA.pem" python demo-pty.py

The script I used: demo-pty.py

kinwin-ustc · 2026-06-27T04:17:02Z

cube_retcode was removed in PR #653, and this PR should follow the same behavior. Furthermore, the configurations for different locations are almost identical; could it be made into a generic configuration block, using include to ensure future maintainability?

Disable nginx response buffering for the verified envd server-streaming endpoints in CubeProxy. This lets early stream frames be delivered promptly for background commands and watch streams, while keeping global buffering behavior unchanged for other routes. Signed-off-by: zhengyilei <zheng_yilei@qq.com>

zyl1121 · 2026-06-27T08:30:27Z

@kinwin-ustc I updated the PR accordingly.

I removed cube_retcode to stay aligned with #653, and replaced the duplicated inline blocks with two small include snippets so the route-specific parts are easier to maintain.

I also re-ran the real-cluster checks through the normal one-click startup path. With the updated template/runtime config and image together, the verified repros still behave as expected.

kinwin-ustc · 2026-06-27T09:38:46Z

LGTM. Thanks for your PR. I'd also like to ask if you've used it in a real production environment? What are the application scenarios, and what is the approximate cluster size?

zyl1121 · 2026-06-27T11:30:07Z

LGTM. Thanks for your PR. I'd also like to ask if you've used it in a real production environment? What are the application scenarios, and what is the approximate cluster size?

@kinwin-ustc Thanks. We are not using CubeSandbox as a fully rolled-out production platform yet, but we have already been testing it internally and are planning a small-scale rollout to users.

Currently, our test cluster has 3 nodes, and we have tested up to around 500 sandboxes. Our main scenario is short-lived code execution for agents: the agent itself runs outside the sandbox, and only creates a sandbox when it needs to execute code or perform file operations.

We initially started exploring CubeSandbox for SWE-bench / RL-style training scenarios as well. However, with the current template mechanism, each image needs to be templated and distributed to the nodes. When the number of images becomes large, potentially thousands in our evaluation, this creates significant storage and management pressure.

We also have some internal long-running agent sandbox scenarios, such as Openclaw, but we have not yet migrated those workloads to CubeSandbox. The main reason is that we still need stronger guarantees around sandbox recovery, node failure handling, persistent storage, and observability before relying on it more broadly. On the observability side, I’m also exploring VM-level ideas including possible eBPF-based approaches.

Personally, I think CubeSandbox is very promising. That is why I have been actively testing it and submitting PRs, and I’m happy to keep contributing and exploring it further.

kinwin-ustc · 2026-06-27T14:08:34Z

We initially started exploring CubeSandbox for SWE-bench / RL-style training scenarios as well. However, with the current template mechanism, each image needs to be templated and distributed to the nodes. When the number of images becomes large, potentially thousands in our evaluation, this creates significant storage and management pressure.

You can use a high-throughput shared storage to store these templates. In addition, CubeSandbox has a fast cold start speed. You can skip the step of creating a template based on an image and directly use the image to cold start the MVM. This will greatly reduce storage overhead and take advantage of CubeSandbox's high concurrency capabilities.

The main reason is that we still need stronger guarantees around sandbox recovery, node failure handling, persistent storage, and observability

These capabilities are all part of our open source plan.

zyl1121 · 2026-06-27T14:48:09Z

@kinwin-ustc Thanks for the clarification.

CubeSandbox has a fast cold start speed. You can skip the step of creating a template based on an image and directly use the image to cold start the MVM.

For the image cold-start path, maybe I missed something. Is there any public documentation or example for directly cold-starting an MVM from an OCI image without creating a template first? From the docs and code paths I checked, the current user-facing flow still seems to be create-template-from-image first, then create sandboxes from the generated template_id. So I’m not sure whether you mean an existing hidden/internal path, a planned feature, or the create-from-image API that simplifies template creation.

You can use a high-throughput shared storage to store these templates.

For the shared storage approach, I also want to confirm the recommended layout. Is the expectation that only template/rootfs/snapshot artifacts are stored on shared storage, while per-sandbox writable/COW layers stay on local disks? I’m not sure whether this separation is already supported in the current implementation, or it is more of a future direction. If the whole cubelet data path is expected to be on shared storage, my concern is that sandbox writable data may add significant pressure to the storage client and backend, depending on the backend type, for example NAS/CephFS or object-storage FUSE mounts.

These capabilities are all part of our open source plan.

Good to know these capabilities are part of the open-source plan. Is there already a public roadmap or design for per-sandbox / VM-level observability? I’m still at the planning stage, and I’m considering possible eBPF-based approaches inside each MicroVM, with events exported back through a vsock-based path, similar to how sandbox logs are collected in #535. The main areas I’m thinking about are process execution, file activity, and syscall-level auditing. Network activity may already be partially covered by cube-egress / network-agent, so I’m less sure about that part. I’d like to first check whether this overlaps with any existing plan. If I get further with this, I’ll open a separate discussion for the design.

kinwin-ustc · 2026-06-27T16:41:00Z

For the image cold-start path, maybe I missed something. Is there any public documentation or example for directly cold-starting an MVM from an OCI image without creating a template first? From the docs and code paths I checked, the current user-facing flow still seems to be create-template-from-image first, then create sandboxes from the generated template_id. So I’m not sure whether you mean an existing hidden/internal path, a planned feature, or the create-from-image API that simplifies template creation.

The open-source version does not directly expose this capability in its workflow.

For the shared storage approach, I also want to confirm the recommended layout. Is the expectation that only template/rootfs/snapshot artifacts are stored on shared storage, while per-sandbox writable/COW layers stay on local disks? I’m not sure whether this separation is already supported in the current implementation, or it is more of a future direction. If the whole cubelet data path is expected to be on shared storage, my concern is that sandbox writable data may add significant pressure to the storage client and backend, depending on the backend type, for example NAS/CephFS or object-storage FUSE mounts.

The most prioritized placement for shared storage should be the image, followed by memory templates. If you need to clone or roll back during runtime and have high performance requirements, it's best to place this part locally or choose a shared storage with CoW capabilities. Temporary storage can be done directly with local drives, persistent storage can use our upcoming Volume SDK, and the backend is a custom distributed storage. The pressure and model of IO determine the choice of backend storage. For image and memory templates, distributed storage has relatively high IO throughput requirements. All of these are not supported in the current version and require some custom selection and development work

Good to know these capabilities are part of the open-source plan. Is there already a public roadmap or design for per-sandbox / VM-level observability? I’m still at the planning stage, and I’m considering possible eBPF-based approaches inside each MicroVM, with events exported back through a vsock-based path, similar to how sandbox logs are collected in #535. The main areas I’m thinking about are process execution, file activity, and syscall-level auditing. Network activity may already be partially covered by cube-egress / network-agent, so I’m less sure about that part. I’d like to first check whether this overlaps with any existing plan. If I get further with this, I’ll open a separate discussion for the design.

This direction is correct. We have already deployed similar solutions internally and implemented higher-performance zero-copy communication channels in the guest kernel, with functions similar to VSOCK but with much higher performance and lower resource consumption

zyl1121 requested review from chenhengqi, fslongjin, ls-ggg, tinklone and up2wing as code owners June 25, 2026 11:22