Skip to content

fix(CubeProxy): disable buffering for envd streaming endpoints#647

Open
zyl1121 wants to merge 1 commit into
TencentCloud:masterfrom
zyl1121:fix/cubeproxy-envd-stream-buffering
Open

fix(CubeProxy): disable buffering for envd streaming endpoints#647
zyl1121 wants to merge 1 commit into
TencentCloud:masterfrom
zyl1121:fix/cubeproxy-envd-stream-buffering

Conversation

@zyl1121

@zyl1121 zyl1121 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Motivation

CubeProxy currently enables proxy_buffering on globally. This is fine for ordinary unary envd endpoints, but it can delay response frames from envd server-streaming endpoints until the stream completes or the request times out.

Two regressions were verified on a real cluster:

  1. sandbox.commands.run(..., background=True) did not return immediately. run() blocked until the command exited, and handle.wait() returned immediately afterwards.
  2. filesystem.Filesystem/WatchDir did not deliver its initial start event or first filesystem event in real time. The first streamed events arrived only near the request timeout, even though the watched file changed much earlier.

What Changed

This PR disables nginx response buffering only for the currently verified envd server-streaming endpoints:

  • process.Process/Start
  • process.Process/Connect
  • filesystem.Filesystem/WatchDir

The change is applied in both CubeProxy data-plane server blocks. The global default buffering behavior remains unchanged for all other routes, keeping the scope narrow.

It covers both CubeProxy entrypoints for those verified streams:

  • host-routed requests such as /process.Process/Start
  • path-routed requests such as /sandbox/<sandbox-id>/<container-port>/process.Process/Start

Testing

Real-cluster background=True repro:

  • before the fix, sandbox.commands.run(..., background=True) still waited for the command to finish before returning, and handle.wait() returned immediately afterwards
  • after the fix, run() returned promptly after the process started, and the waiting moved back to handle.wait()

Real-cluster WatchDir repro:

  • the existing watcher path in the SDK continued to deliver CREATE / WRITE around the scheduled 1s file write before and after the fix
  • before the fix, a direct Filesystem.WatchDir stream delivered start and the first filesystem event only around the 6s request timeout boundary
  • after the fix, start arrived in about 15ms and the first streamed filesystem event arrived around the scheduled 1s file write

Follow-up validation for maintainer question: interactive Process.Connect + SendInput:

  • ran a real-cluster repro for an interactive process stream with Process.Connect and delayed input
  • after the deliberate 1s input delay, the stream delivered echo:ping-from-sendinput within the measurement resolution, and the control-file side effect appeared about 220ms later
  • the process exited later, confirming the echoed output was streamed before process exit rather than flushed at the end
  • this indicates the current Process.Connect unbuffering already covers the interactive input scenario, so a separate unbuffered SendInput route is not needed based on this validation

Follow-up validation for maintainer question: PTY path:

  • ran a separate real-cluster repro through sandbox.pty.create() + sandbox.pty.connect() + sandbox.pty.send_stdin()
  • with the current patch, sandbox.pty.connect() attached promptly
  • after the deliberate 1s input delay, sandbox.pty.send_stdin() returned within about 10ms, the PTY stream delivered the marker within about 15ms, and the control-file side effect appeared within about 50ms
  • the shell exited later, confirming the marker was streamed before process exit rather than flushed at the end
  • after intentionally removing the unbuffering change, the same PTY repro no longer sustained a usable session: sandbox.pty.create() only returned near the timeout boundary, and the subsequent sandbox.pty.connect() / sandbox.pty.send_stdin() calls failed with NotFoundException
  • no PTY stream events were delivered and the control-file side effect never appeared in that pre-fix run
  • this indicates the PTY failure already happens at Start / Connect time under buffered proxy behavior, which is stronger evidence for targeting the streaming routes instead of adding a separate unbuffered SendInput route

Example Repro

The clearest minimal repro for the buffering issue is the low-level envd Filesystem.WatchDir server stream.

The same sandbox is exercised through two paths:

  • control path: sandbox.files.watch_dir() keeps working and sees the scheduled file write around the expected 1s point
  • streaming path: direct Filesystem.WatchDir should emit start immediately and then emit the first filesystem event when the file is written

Minimal repro sketch:

watch = sandbox.files.watch_dir("/tmp/watchdir")

stream = sandbox.files._rpc.watch_dir(
    filesystem_pb2.WatchDirRequest(path="/tmp/watchdir", recursive=False),
    headers=headers,
    timeout=6.0,
    request_timeout=sandbox.connection_config.get_request_timeout(6.0),
)

sandbox.commands.run(
    "sh -lc 'nohup sh -lc \"sleep 1; printf stream > /tmp/watchdir/stream.txt\" >/dev/null 2>&1 &'"
)

Observed behavior on a real cluster:

  • before the fix, the control path still saw the file write around the scheduled 1s point, but the direct WatchDir stream did not deliver start or the first filesystem event until around the 6s request timeout boundary
  • after the fix, start arrived in about 15ms and the first filesystem event arrived around the scheduled 1s write

@cubesandboxbot

cubesandboxbot Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review: PR #647 - CubeProxy nginx streaming buffering fix

The change is functionally correct and well-scoped. proxy_buffering off is the right fix for gRPC server-streaming endpoints where nginx buffering batches frames until buffer-full or timeout.


1. Sandbox path bypasses unbuffered endpoints (medium)

Nginx ^~ prefix locations take priority over regex. Requests to /sandbox///process.Process/Start match ^~ /sandbox/ (lines 138/272) and inherit global proxy_buffering on. If streaming endpoints are only host-routed this is fine; otherwise the ^~ /sandbox/ blocks need proxy_buffering off too.

2. Slow-client exhaustion risk (medium)

proxy_buffering off + 7206s timeouts means slow-reading clients hold upstream connections for up to 2 hours. Consider tighter timeouts or application-level heartbeats.

3. proxy_intercept_errors not effective (low)

With buffering off, upstream error details pass through verbatim. Add explicit proxy_intercept_errors off to make intent visible.


Issue Severity
Sandbox path bypasses unbuffered endpoints Medium
Slow-client exhaustion Medium
proxy_intercept_errors silently ineffective Low

Comment thread CubeProxy/nginx.conf Outdated
Comment thread CubeProxy/nginx.conf Outdated
Comment thread CubeProxy/nginx.conf Outdated
Comment thread CubeProxy/nginx.conf Outdated
Comment thread CubeProxy/nginx.conf Outdated
Comment thread CubeProxy/nginx.conf Outdated
Comment thread CubeProxy/nginx.conf Outdated
Comment thread CubeProxy/nginx.conf Outdated
@zyl1121 zyl1121 force-pushed the fix/cubeproxy-envd-stream-buffering branch from 956aed3 to 9168339 Compare June 25, 2026 11:51
Comment thread CubeProxy/nginx.conf Outdated
Comment thread CubeProxy/nginx.conf Outdated
Comment thread CubeProxy/nginx.conf Outdated
@zyl1121 zyl1121 force-pushed the fix/cubeproxy-envd-stream-buffering branch from 9168339 to 1f3633a Compare June 25, 2026 12:17

@chenhengqi chenhengqi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a similar attempt in #577, but we consider it suboptimal anyway.
It would be better if we can address it in envd side.

Could you please craft an example along with this PR? Thanks.

@zyl1121

zyl1121 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

I have a similar attempt in #577, but we consider it suboptimal anyway.
It would be better if we can address it in envd side.
Could you please craft an example along with this PR? Thanks.

Thanks, that makes sense.

I added a Filesystem.WatchDir repro example to the PR. In the same sandbox, the regular watch helper still sees the scheduled file write in about 1s, while before this fix the low-level WatchDir stream did not deliver events until near the timeout boundary.

This PR keeps the proxy-side mitigation narrow for the currently verified streaming paths, and the same repro should also be useful for follow-up on the envd side.

Comment thread CubeProxy/nginx.conf
# Path-based server-streaming envd endpoints. Keep buffering enabled
# elsewhere under /sandbox/ and disable it only for these verified
# response streams.
location ~ ^/sandbox/[^/]+/\d+/(?:process\.Process/(?:Start|Connect)|filesystem\.Filesystem/WatchDir)$ {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need SendInput here?

@zyl1121 zyl1121 Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need SendInput here?

I tested this on a real cluster as a follow-up, and I don't think we need a separate unbuffered SendInput route. SendInput itself is still a unary/input-side call, the real-time output is delivered through the already-unbuffered Start / Connect streaming paths.

In the PTY repro, with the current patch, pty.send_stdin() returned in about 10ms and the PTY marker was streamed back in about 15ms. After removing the unbuffering change for Start / Connect, the serial PTY flow was blocked earlier: pty.create() was buffered until near the connection timeout, and the later connect() / send_stdin() calls failed with NotFoundException. This points to the streaming Start / Connect path as the failure point, not SendInput.

I also attached the repro script I used. After exporting the corresponding SDK environment variables for the target cluster, it can be run directly, for example:

export E2B_API_URL="<your-cube-api-url>"
export E2B_API_KEY="e2b_000000"
export CUBE_TEMPLATE_ID="<your-template-id>"
export SSL_CERT_FILE="/root/.local/share/mkcert/rootCA.pem"
python demo-pty.py

The script I used: demo-pty.py

@kinwin-ustc

Copy link
Copy Markdown
Collaborator

cube_retcode was removed in PR #653, and this PR should follow the same behavior. Furthermore, the configurations for different locations are almost identical; could it be made into a generic configuration block, using include to ensure future maintainability?

@zyl1121 zyl1121 force-pushed the fix/cubeproxy-envd-stream-buffering branch from 1f3633a to 47dd074 Compare June 27, 2026 08:11
Comment thread CubeProxy/nginx.conf
Comment thread CubeProxy/nginx.conf
Comment thread CubeProxy/nginx.conf
Comment thread CubeProxy/nginx.conf
Comment thread CubeProxy/nginx.conf
Disable nginx response buffering for the verified envd server-streaming endpoints in CubeProxy.

This lets early stream frames be delivered promptly for background commands and watch streams, while keeping global buffering behavior unchanged for other routes.

Signed-off-by: zhengyilei <zheng_yilei@qq.com>
@zyl1121 zyl1121 force-pushed the fix/cubeproxy-envd-stream-buffering branch from 47dd074 to a779149 Compare June 27, 2026 08:30
@zyl1121

zyl1121 commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

@kinwin-ustc I updated the PR accordingly.

I removed cube_retcode to stay aligned with #653, and replaced the duplicated inline blocks with two small include snippets so the route-specific parts are easier to maintain.

I also re-ran the real-cluster checks through the normal one-click startup path. With the updated template/runtime config and image together, the verified repros still behave as expected.

@kinwin-ustc

Copy link
Copy Markdown
Collaborator

LGTM. Thanks for your PR. I'd also like to ask if you've used it in a real production environment? What are the application scenarios, and what is the approximate cluster size?

@zyl1121

zyl1121 commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

LGTM. Thanks for your PR. I'd also like to ask if you've used it in a real production environment? What are the application scenarios, and what is the approximate cluster size?

@kinwin-ustc Thanks. We are not using CubeSandbox as a fully rolled-out production platform yet, but we have already been testing it internally and are planning a small-scale rollout to users.

Currently, our test cluster has 3 nodes, and we have tested up to around 500 sandboxes. Our main scenario is short-lived code execution for agents: the agent itself runs outside the sandbox, and only creates a sandbox when it needs to execute code or perform file operations.

We initially started exploring CubeSandbox for SWE-bench / RL-style training scenarios as well. However, with the current template mechanism, each image needs to be templated and distributed to the nodes. When the number of images becomes large, potentially thousands in our evaluation, this creates significant storage and management pressure.

We also have some internal long-running agent sandbox scenarios, such as Openclaw, but we have not yet migrated those workloads to CubeSandbox. The main reason is that we still need stronger guarantees around sandbox recovery, node failure handling, persistent storage, and observability before relying on it more broadly. On the observability side, I’m also exploring VM-level ideas including possible eBPF-based approaches.

Personally, I think CubeSandbox is very promising. That is why I have been actively testing it and submitting PRs, and I’m happy to keep contributing and exploring it further.

@kinwin-ustc

Copy link
Copy Markdown
Collaborator

We initially started exploring CubeSandbox for SWE-bench / RL-style training scenarios as well. However, with the current template mechanism, each image needs to be templated and distributed to the nodes. When the number of images becomes large, potentially thousands in our evaluation, this creates significant storage and management pressure.

You can use a high-throughput shared storage to store these templates. In addition, CubeSandbox has a fast cold start speed. You can skip the step of creating a template based on an image and directly use the image to cold start the MVM. This will greatly reduce storage overhead and take advantage of CubeSandbox's high concurrency capabilities.

The main reason is that we still need stronger guarantees around sandbox recovery, node failure handling, persistent storage, and observability

These capabilities are all part of our open source plan.

@zyl1121

zyl1121 commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

@kinwin-ustc Thanks for the clarification.

CubeSandbox has a fast cold start speed. You can skip the step of creating a template based on an image and directly use the image to cold start the MVM.

For the image cold-start path, maybe I missed something. Is there any public documentation or example for directly cold-starting an MVM from an OCI image without creating a template first? From the docs and code paths I checked, the current user-facing flow still seems to be create-template-from-image first, then create sandboxes from the generated template_id. So I’m not sure whether you mean an existing hidden/internal path, a planned feature, or the create-from-image API that simplifies template creation.

You can use a high-throughput shared storage to store these templates.

For the shared storage approach, I also want to confirm the recommended layout. Is the expectation that only template/rootfs/snapshot artifacts are stored on shared storage, while per-sandbox writable/COW layers stay on local disks? I’m not sure whether this separation is already supported in the current implementation, or it is more of a future direction. If the whole cubelet data path is expected to be on shared storage, my concern is that sandbox writable data may add significant pressure to the storage client and backend, depending on the backend type, for example NAS/CephFS or object-storage FUSE mounts.

These capabilities are all part of our open source plan.

Good to know these capabilities are part of the open-source plan. Is there already a public roadmap or design for per-sandbox / VM-level observability? I’m still at the planning stage, and I’m considering possible eBPF-based approaches inside each MicroVM, with events exported back through a vsock-based path, similar to how sandbox logs are collected in #535. The main areas I’m thinking about are process execution, file activity, and syscall-level auditing. Network activity may already be partially covered by cube-egress / network-agent, so I’m less sure about that part. I’d like to first check whether this overlaps with any existing plan. If I get further with this, I’ll open a separate discussion for the design.

@kinwin-ustc

Copy link
Copy Markdown
Collaborator

For the image cold-start path, maybe I missed something. Is there any public documentation or example for directly cold-starting an MVM from an OCI image without creating a template first? From the docs and code paths I checked, the current user-facing flow still seems to be create-template-from-image first, then create sandboxes from the generated template_id. So I’m not sure whether you mean an existing hidden/internal path, a planned feature, or the create-from-image API that simplifies template creation.

The open-source version does not directly expose this capability in its workflow.

For the shared storage approach, I also want to confirm the recommended layout. Is the expectation that only template/rootfs/snapshot artifacts are stored on shared storage, while per-sandbox writable/COW layers stay on local disks? I’m not sure whether this separation is already supported in the current implementation, or it is more of a future direction. If the whole cubelet data path is expected to be on shared storage, my concern is that sandbox writable data may add significant pressure to the storage client and backend, depending on the backend type, for example NAS/CephFS or object-storage FUSE mounts.

The most prioritized placement for shared storage should be the image, followed by memory templates. If you need to clone or roll back during runtime and have high performance requirements, it's best to place this part locally or choose a shared storage with CoW capabilities. Temporary storage can be done directly with local drives, persistent storage can use our upcoming Volume SDK, and the backend is a custom distributed storage. The pressure and model of IO determine the choice of backend storage. For image and memory templates, distributed storage has relatively high IO throughput requirements. All of these are not supported in the current version and require some custom selection and development work

Good to know these capabilities are part of the open-source plan. Is there already a public roadmap or design for per-sandbox / VM-level observability? I’m still at the planning stage, and I’m considering possible eBPF-based approaches inside each MicroVM, with events exported back through a vsock-based path, similar to how sandbox logs are collected in #535. The main areas I’m thinking about are process execution, file activity, and syscall-level auditing. Network activity may already be partially covered by cube-egress / network-agent, so I’m less sure about that part. I’d like to first check whether this overlaps with any existing plan. If I get further with this, I’ll open a separate discussion for the design.

This direction is correct. We have already deployed similar solutions internally and implemented higher-performance zero-copy communication channels in the guest kernel, with functions similar to VSOCK but with much higher performance and lower resource consumption

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants