Skip to content

Intermittent sandbox/Agent Drive API failures after standby/resume #142

@vivek100

Description

@vivek100

Raw logs and full repro artifacts are in this secret Gist: https://gist.github.com/vivek100/f3ebec62813042ec63bacb24e855e4f8

Blaxel Incident Report: Agent Drive File API Failures Through Sandbox

Summary

OpenCowork is seeing intermittent failures when accessing an Agent Drive through Blaxel sandboxes. The same flows sometimes work and sometimes fail. The failures appear in three paths:

  • OpenCowork HTTP API routes that write/list/read files through a mounted sandbox drive.
  • TypeScript SDK calls against @blaxel/core.
  • Python SDK calls against blaxel.

In one app-level route test, Blaxel reports the drive mounted, but files written under /workspace are not visible through the drive listing route. In direct SDK repros, sandbox sub-APIs such as drives.list, drives.mount, process.exec, and fs.read fail with connection/fetch errors.

Important nuance: the OpenCowork frontend often surfaces these as HTTP 400 responses because our Express drive routes currently catch sandbox/SDK errors and return res.status(400).json({ error: err.message }). A frontend 400 in this area should therefore be read as "the backend drive operation failed" unless the response body is one of our explicit validation errors such as Invalid drive path, No files provided, Too many files, Missing dataBase64, or exceeds 5MB upload limit.

Please inspect Blaxel logs for the sandbox names below around the UTC timestamps listed.

Environment

  • Blaxel workspace: openclawguy
  • Sandbox image/template: template-guardian
  • Agent Drive name/id: open-cowork-agent-drive
  • Agent Drive display name: OpenCowork Agent Drive
  • Agent Drive region: us-was-1
  • Drive mount path: /workspace
  • Drive path: /
  • OpenCowork agent id: code-agent
  • Session-scoped harness id format: augment-agent-<sessionId>
  • Session-scoped sandbox name format: augment-<sessionId prefix>
  • Node version from failing app test: v22.13.1
  • TypeScript SDK package: @blaxel/core@0.2.79
  • Python SDK package path in stack trace: blaxel.core.sandbox.default.*

No API keys are included in this report.

Repro Run 1: OpenCowork HTTP API Through Mounted Sandbox

Run time: 2026-05-13T18:00Z approximate.

Command:

cd <open-cowork>/personalV0/augment/server
npm run test:e2e:drive-api

Observed run identifiers:

  • Session id prefix: 78e63172
  • Inferred sandbox name prefix: augment-78e63172
  • Drive: open-cowork-agent-drive
  • Mount returned by drives.list: {"driveName":"open-cowork-agent-drive","mountPath":"/workspace","drivePath":"/"}

Relevant output:

> @augment/server@0.1.0 test:e2e:drive-api
> tsx src/tests/e2e-drive-api.ts

[drives.list] raw response: {"mounts":[]}
ok: session created - 78e63172
ok: workspace registered
ok: drive registered - open-cowork-agent-drive
ok: drive endpoint returns id - open-cowork-agent-drive
ok: drive endpoint returns mount path - /workspace
[drives.list] raw response: {"mounts":[{"driveName":"open-cowork-agent-drive","mountPath":"/workspace","drivePath":"/"}]}

Error: root lists demo directory failed
    at assertCheck (...\src\tests\e2e-drive-api.ts:20:11)
    at <anonymous> (...\src\tests\e2e-drive-api.ts:95:3)

What the test does:

  1. Creates a session with workspaceProvider=sandbox, sandboxProvider=blaxel.
  2. Provisions a real Blaxel sandbox and mounts open-cowork-agent-drive at /workspace.
  3. Uses sandbox.process.exec to run:
mkdir -p /workspace/demo/subdir &&
printf "hello from agent drive\n" > /workspace/demo/hello.txt &&
printf "nested file\n" > /workspace/demo/subdir/nested.txt
  1. Calls the OpenCowork HTTP route GET /api/sessions/:id/drive/files?path=/.

Expected:

  • Root listing includes /demo.

Actual:

  • Blaxel mount list reports the drive mounted.
  • The API does not see /demo, so the test fails at root lists demo directory.

Local captured log:

  • personalV0/augment/server/drive-api-failure-2026-05-13.log

Control Run: Same Drive Can Work Immediately After Mount

Run time: 2026-05-13T18:08Z approximate.

Sandbox intentionally left for Blaxel inspection:

  • Sandbox name: open-cowork-drive-incident-loop-20260513
  • Workspace: openclawguy
  • Region: us-was-1
  • Image/template: template-guardian
  • Drive: open-cowork-agent-drive
  • Mount path: /workspace

This run created a fresh sandbox, mounted the same drive, then performed eight repeated write/read/list attempts. All attempts passed, both immediately after write and after a 1.5 second delay.

Relevant output excerpt:

{"at":"2026-05-13T18:08:09.128Z","label":"drives.mount","ok":true,"value":{"success":true,"message":"Drive mounted successfully","driveName":"open-cowork-agent-drive","mountPath":"/workspace","drivePath":"/"}}
{"at":"2026-05-13T18:08:09.214Z","label":"drives.list.afterMount","ok":true,"value":[{"driveName":"open-cowork-agent-drive","mountPath":"/workspace","drivePath":"/"}]}
{"at":"2026-05-13T18:08:09.407Z","label":"attempt.1.process.write","ok":true}
{"at":"2026-05-13T18:08:09.643Z","label":"attempt.1.fs.read.immediate","ok":true,"value":"loop-1-1778695689214\n"}
{"at":"2026-05-13T18:08:09.910Z","label":"attempt.1.process.list.immediate","ok":true}
...
{"at":"2026-05-13T18:08:26.852Z","label":"attempt.8.fs.read.afterDelay","ok":true,"value":"loop-8-1778695704901\n"}
{"at":"2026-05-13T18:08:26.992Z","label":"attempt.8.process.list.afterDelay","ok":true}

Interpretation of this control run:

  • The failure is not deterministic.
  • The evidence does not support "all reads immediately after mount fail."
  • The same drive and template can work immediately after mount in a fresh sandbox.
  • This makes the issue look intermittent or dependent on sandbox instance/readiness/state, rather than a simple required propagation delay.

Local captured log:

  • personalV0/augment/server/drive-loop-probe-2026-05-13.log

Follow-Up Probe: Existing Sandboxes After Standby

Run time: 2026-05-13T18:10Z approximate.

Command output captured in:

  • personalV0/augment/server/drive-old-sandbox-probe-2026-05-13.log

This probe reused three already-created incident sandboxes and performed five write/read/list rounds on each.

Summary:

open-cowork-drive-incident-loop-20260513
  ok=16 fail=2
  FAIL 2026-05-13T18:10:22.090Z drives.list.initial: TypeError: fetch failed
  FAIL 2026-05-13T18:10:22.137Z process.pwd: TypeError: fetch failed

open-cowork-drive-incident-20260513
  ok=18 fail=0

open-cowork-drive-incident-20260513-py
  ok=17 fail=1
  FAIL 2026-05-13T18:10:36.998Z process.pwd: TypeError: fetch failed

Interpretation:

  • This does not look like "old sandboxes always fail."
  • It also does not look like "new sandboxes always work."
  • The strongest pattern from this probe is first-call flakiness after an existing sandbox is in STANDBY and then reused.
  • Once the sandbox accepts a successful operation, subsequent write/read/list calls usually succeed in the same short window.
  • This pattern matches the frontend symptom: opening the file viewer or refreshing the tree can produce a transient HTTP 400/backend error, but a later refresh may work.

Repro Run 2: Direct TypeScript SDK

Run time: 2026-05-13T18:02Z approximate.

Sandbox intentionally left for Blaxel inspection:

  • Sandbox name: open-cowork-drive-incident-20260513
  • Workspace: openclawguy
  • Region: us-was-1
  • Image/template: template-guardian
  • Drive: open-cowork-agent-drive
  • Mount path: /workspace

Command shape:

import 'dotenv/config';
import { SandboxInstance, DriveInstance } from '@blaxel/core';

const sandbox = await SandboxInstance.createIfNotExists({
  name: 'open-cowork-drive-incident-20260513',
  image: process.env.BL_SANDBOX_TEMPLATE || 'blaxel/base-image:latest',
  memory: 2048,
  region: 'us-was-1',
});

await DriveInstance.createIfNotExists({
  name: 'open-cowork-agent-drive',
  region: 'us-was-1',
  displayName: 'OpenCowork Agent Drive',
});

await sandbox.drives.list();
await sandbox.drives.mount({
  driveName: 'open-cowork-agent-drive',
  mountPath: '/workspace',
  drivePath: '/',
});
await sandbox.process.exec({
  command: "mkdir -p /workspace && printf 'incident repro\\n' > /workspace/incident.txt && sync && ls -la /workspace",
  waitForCompletion: true,
  workingDir: '/',
});
await sandbox.fs.read('/workspace/incident.txt');

Observed output:

--- config ---
{
  "workspace": "openclawguy",
  "sandboxName": "open-cowork-drive-incident-20260513",
  "driveName": "open-cowork-agent-drive",
  "region": "us-was-1",
  "mountPath": "/workspace",
  "drivePath": "/",
  "image": "template-guardian"
}

--- drives.list.before.error ---
TypeError: fetch failed
    at async SandboxDrive.list (.../@blaxel/core/dist/esm/sandbox/drive/drive.js:55:26)

--- drives.mount.error ---
TypeError: fetch failed
    at async SandboxDrive.mount (.../@blaxel/core/dist/esm/sandbox/drive/drive.js:17:26)

[drives.list] raw response: {"mounts":[]}

--- drives.list.after ---
[]

--- fatal.error ---
TypeError: fetch failed
    at async SandboxProcess.exec (.../@blaxel/core/dist/esm/sandbox/process/process.js:111:47)

Local captured log:

  • personalV0/augment/server/sdk-drive-incident-2026-05-13.log

Repro Run 3: Direct Python SDK

Run time: 2026-05-13T18:03Z approximate.

Sandbox intentionally left for Blaxel inspection:

  • Sandbox name: open-cowork-drive-incident-20260513-py
  • Workspace: openclawguy
  • Region: us-was-1
  • Image/template: template-guardian
  • Drive: open-cowork-agent-drive
  • Mount path: /workspace

Command shape:

from blaxel.core import SandboxInstance
from blaxel.core.drive import DriveInstance

sandbox = await SandboxInstance.create_if_not_exists({
    "name": "open-cowork-drive-incident-20260513-py",
    "image": "template-guardian",
    "memory": 2048,
    "region": "us-was-1",
})

drive = await DriveInstance.create_if_not_exists({
    "name": "open-cowork-agent-drive",
    "region": "us-was-1",
    "display_name": "OpenCowork Agent Drive",
})

await sandbox.drives.list()
await sandbox.drives.mount(
    drive_name="open-cowork-agent-drive",
    mount_path="/workspace",
    drive_path="/",
)
await sandbox.process.exec({
    "command": "mkdir -p /workspace && printf 'python incident repro\\n' > /workspace/python-incident.txt && sync",
    "wait_for_completion": True,
    "working_dir": "/",
})
await sandbox.fs.read("/workspace/python-incident.txt")

Observed output:

--- config ---
{
  "workspace": "openclawguy",
  "sandboxName": "open-cowork-drive-incident-20260513-py",
  "driveName": "open-cowork-agent-drive",
  "region": "us-was-1",
  "mountPath": "/workspace",
  "drivePath": "/",
  "image": "template-guardian"
}

--- drives.list.before.error ---
httpx.ConnectError
  File "...site-packages\\blaxel\\core\\sandbox\\default\\drive.py", line 72, in list
    response = await client.get("/drives/mount")

--- drives.mount.error ---
httpx.ConnectError
  File "...site-packages\\blaxel\\core\\sandbox\\default\\drive.py", line 40, in mount
    response = await client.post("/drives/mount", json=payload)

--- process.exec.write.error ---
httpx.ConnectError
  File "...site-packages\\blaxel\\core\\sandbox\\default\\process.py", line 252, in exec
    response = await client.post("/process", json=process.to_dict())

--- sandbox.fs.read.error ---
httpx.ConnectError
  File "...site-packages\\blaxel\\core\\sandbox\\default\\filesystem.py", line 156, in read
    response = await client.get(f"/filesystem/{path}")

Local captured log:

  • personalV0/augment/server/python-sdk-drive-incident-2026-05-13.log

Expected Behavior

  • sandbox.drives.list() should reliably return mounted drives.
  • sandbox.drives.mount() should mount open-cowork-agent-drive at /workspace or return a typed API error.
  • sandbox.process.exec() should execute inside the sandbox once SandboxInstance.createIfNotExists returns a sandbox.
  • Files written under /workspace should be visible through both:
    • shell/process reads inside the sandbox
    • filesystem/list/read APIs used by the SDK and our HTTP routes

Actual Behavior

  • TypeScript SDK intermittently returns TypeError: fetch failed for sandbox drive/process APIs.
  • Python SDK returns httpx.ConnectError for sandbox drive/process/filesystem APIs.
  • In the OpenCowork HTTP route repro, drives.list reports the drive mounted, but a file tree written into /workspace is not visible via the drive listing API.

Interpretation

  • This looks like a flaky sandbox/drive API or sandbox reachability issue, not a deterministic OpenCowork-only validation problem.
  • A fresh control sandbox at 2026-05-13T18:08Z succeeded on immediate post-mount write/read/list operations for eight attempts, so the issue is not simply "read immediately after mount always fails."
  • A follow-up probe at 2026-05-13T18:10Z saw first-call failures after sandbox standby/resume, followed by successful operations. That is the strongest current lead.
  • Both SDKs fail at the sandbox API boundary in the fixed repros, before OpenCowork-specific file parsing or preview logic.
  • The Python stack traces show failures calling sandbox API endpoints:
    • GET /drives/mount
    • POST /drives/mount
    • POST /process
    • GET /filesystem/{path}
  • The TypeScript stack traces show the same classes of failures in @blaxel/core.
  • The mounted drive is reported by Blaxel in one repro, but drive file visibility is inconsistent afterward.
  • OpenCowork should improve its own error mapping so SDK/sandbox connection failures are not returned as generic HTTP 400, but that mapping does not explain the underlying SDK connection failures.

Request for Blaxel

Please inspect logs/metrics for:

  • Workspace: openclawguy
  • Sandbox: open-cowork-drive-incident-20260513
  • Sandbox: open-cowork-drive-incident-20260513-py
  • Session/sandbox prefix from HTTP repro: 78e63172 / augment-78e63172...
  • Drive: open-cowork-agent-drive
  • Time window: 2026-05-13T18:00:00Z to 2026-05-13T18:05:00Z

Questions:

  1. Are the sandbox internal API endpoints failing to come up or becoming unreachable after sandbox creation/resume?
  2. Are Agent Drive mounts succeeding but not surfacing a consistent filesystem view at /workspace?
  3. Is template-guardian missing something required for sandbox API/drive support, or is this happening below the template layer?
  4. Is there an account/quota/rate-limit condition that would cause fetch failed/httpx.ConnectError instead of a typed Blaxel API error?
  5. Can the SDKs expose the underlying URL/status/error body for these sandbox API connection failures?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions