Intermittent sandbox/Agent Drive API failures after standby/resume

﻿Raw logs and full repro artifacts are in this secret Gist: https://gist.github.com/vivek100/f3ebec62813042ec63bacb24e855e4f8

# Blaxel Incident Report: Agent Drive File API Failures Through Sandbox

## Summary

OpenCowork is seeing intermittent failures when accessing an Agent Drive through Blaxel sandboxes. The same flows sometimes work and sometimes fail. The failures appear in three paths:

- OpenCowork HTTP API routes that write/list/read files through a mounted sandbox drive.
- TypeScript SDK calls against `@blaxel/core`.
- Python SDK calls against `blaxel`.

In one app-level route test, Blaxel reports the drive mounted, but files written under `/workspace` are not visible through the drive listing route. In direct SDK repros, sandbox sub-APIs such as `drives.list`, `drives.mount`, `process.exec`, and `fs.read` fail with connection/fetch errors.

Important nuance: the OpenCowork frontend often surfaces these as HTTP `400` responses because our Express drive routes currently catch sandbox/SDK errors and return `res.status(400).json({ error: err.message })`. A frontend `400` in this area should therefore be read as "the backend drive operation failed" unless the response body is one of our explicit validation errors such as `Invalid drive path`, `No files provided`, `Too many files`, `Missing dataBase64`, or `exceeds 5MB upload limit`.

Please inspect Blaxel logs for the sandbox names below around the UTC timestamps listed.

## Environment

- Blaxel workspace: `openclawguy`
- Sandbox image/template: `template-guardian`
- Agent Drive name/id: `open-cowork-agent-drive`
- Agent Drive display name: `OpenCowork Agent Drive`
- Agent Drive region: `us-was-1`
- Drive mount path: `/workspace`
- Drive path: `/`
- OpenCowork agent id: `code-agent`
- Session-scoped harness id format: `augment-agent-<sessionId>`
- Session-scoped sandbox name format: `augment-<sessionId prefix>`
- Node version from failing app test: `v22.13.1`
- TypeScript SDK package: `@blaxel/core@0.2.79`
- Python SDK package path in stack trace: `blaxel.core.sandbox.default.*`

No API keys are included in this report.

## Repro Run 1: OpenCowork HTTP API Through Mounted Sandbox

Run time: `2026-05-13T18:00Z` approximate.

Command:

```powershell
cd <open-cowork>/personalV0/augment/server
npm run test:e2e:drive-api
```

Observed run identifiers:

- Session id prefix: `78e63172`
- Inferred sandbox name prefix: `augment-78e63172`
- Drive: `open-cowork-agent-drive`
- Mount returned by `drives.list`: `{"driveName":"open-cowork-agent-drive","mountPath":"/workspace","drivePath":"/"}`

Relevant output:

```text
> @augment/server@0.1.0 test:e2e:drive-api
> tsx src/tests/e2e-drive-api.ts

[drives.list] raw response: {"mounts":[]}
ok: session created - 78e63172
ok: workspace registered
ok: drive registered - open-cowork-agent-drive
ok: drive endpoint returns id - open-cowork-agent-drive
ok: drive endpoint returns mount path - /workspace
[drives.list] raw response: {"mounts":[{"driveName":"open-cowork-agent-drive","mountPath":"/workspace","drivePath":"/"}]}

Error: root lists demo directory failed
    at assertCheck (...\src\tests\e2e-drive-api.ts:20:11)
    at <anonymous> (...\src\tests\e2e-drive-api.ts:95:3)
```

What the test does:

1. Creates a session with `workspaceProvider=sandbox`, `sandboxProvider=blaxel`.
2. Provisions a real Blaxel sandbox and mounts `open-cowork-agent-drive` at `/workspace`.
3. Uses `sandbox.process.exec` to run:

```bash
mkdir -p /workspace/demo/subdir &&
printf "hello from agent drive\n" > /workspace/demo/hello.txt &&
printf "nested file\n" > /workspace/demo/subdir/nested.txt
```

4. Calls the OpenCowork HTTP route `GET /api/sessions/:id/drive/files?path=/`.

Expected:

- Root listing includes `/demo`.

Actual:

- Blaxel mount list reports the drive mounted.
- The API does not see `/demo`, so the test fails at `root lists demo directory`.

Local captured log:

- `personalV0/augment/server/drive-api-failure-2026-05-13.log`

## Control Run: Same Drive Can Work Immediately After Mount

Run time: `2026-05-13T18:08Z` approximate.

Sandbox intentionally left for Blaxel inspection:

- Sandbox name: `open-cowork-drive-incident-loop-20260513`
- Workspace: `openclawguy`
- Region: `us-was-1`
- Image/template: `template-guardian`
- Drive: `open-cowork-agent-drive`
- Mount path: `/workspace`

This run created a fresh sandbox, mounted the same drive, then performed eight repeated write/read/list attempts. All attempts passed, both immediately after write and after a 1.5 second delay.

Relevant output excerpt:

```text
{"at":"2026-05-13T18:08:09.128Z","label":"drives.mount","ok":true,"value":{"success":true,"message":"Drive mounted successfully","driveName":"open-cowork-agent-drive","mountPath":"/workspace","drivePath":"/"}}
{"at":"2026-05-13T18:08:09.214Z","label":"drives.list.afterMount","ok":true,"value":[{"driveName":"open-cowork-agent-drive","mountPath":"/workspace","drivePath":"/"}]}
{"at":"2026-05-13T18:08:09.407Z","label":"attempt.1.process.write","ok":true}
{"at":"2026-05-13T18:08:09.643Z","label":"attempt.1.fs.read.immediate","ok":true,"value":"loop-1-1778695689214\n"}
{"at":"2026-05-13T18:08:09.910Z","label":"attempt.1.process.list.immediate","ok":true}
...
{"at":"2026-05-13T18:08:26.852Z","label":"attempt.8.fs.read.afterDelay","ok":true,"value":"loop-8-1778695704901\n"}
{"at":"2026-05-13T18:08:26.992Z","label":"attempt.8.process.list.afterDelay","ok":true}
```

Interpretation of this control run:

- The failure is not deterministic.
- The evidence does not support "all reads immediately after mount fail."
- The same drive and template can work immediately after mount in a fresh sandbox.
- This makes the issue look intermittent or dependent on sandbox instance/readiness/state, rather than a simple required propagation delay.

Local captured log:

- `personalV0/augment/server/drive-loop-probe-2026-05-13.log`

## Follow-Up Probe: Existing Sandboxes After Standby

Run time: `2026-05-13T18:10Z` approximate.

Command output captured in:

- `personalV0/augment/server/drive-old-sandbox-probe-2026-05-13.log`

This probe reused three already-created incident sandboxes and performed five write/read/list rounds on each.

Summary:

```text
open-cowork-drive-incident-loop-20260513
  ok=16 fail=2
  FAIL 2026-05-13T18:10:22.090Z drives.list.initial: TypeError: fetch failed
  FAIL 2026-05-13T18:10:22.137Z process.pwd: TypeError: fetch failed

open-cowork-drive-incident-20260513
  ok=18 fail=0

open-cowork-drive-incident-20260513-py
  ok=17 fail=1
  FAIL 2026-05-13T18:10:36.998Z process.pwd: TypeError: fetch failed
```

Interpretation:

- This does not look like "old sandboxes always fail."
- It also does not look like "new sandboxes always work."
- The strongest pattern from this probe is first-call flakiness after an existing sandbox is in `STANDBY` and then reused.
- Once the sandbox accepts a successful operation, subsequent write/read/list calls usually succeed in the same short window.
- This pattern matches the frontend symptom: opening the file viewer or refreshing the tree can produce a transient HTTP `400`/backend error, but a later refresh may work.

## Repro Run 2: Direct TypeScript SDK

Run time: `2026-05-13T18:02Z` approximate.

Sandbox intentionally left for Blaxel inspection:

- Sandbox name: `open-cowork-drive-incident-20260513`
- Workspace: `openclawguy`
- Region: `us-was-1`
- Image/template: `template-guardian`
- Drive: `open-cowork-agent-drive`
- Mount path: `/workspace`

Command shape:

```js
import 'dotenv/config';
import { SandboxInstance, DriveInstance } from '@blaxel/core';

const sandbox = await SandboxInstance.createIfNotExists({
  name: 'open-cowork-drive-incident-20260513',
  image: process.env.BL_SANDBOX_TEMPLATE || 'blaxel/base-image:latest',
  memory: 2048,
  region: 'us-was-1',
});

await DriveInstance.createIfNotExists({
  name: 'open-cowork-agent-drive',
  region: 'us-was-1',
  displayName: 'OpenCowork Agent Drive',
});

await sandbox.drives.list();
await sandbox.drives.mount({
  driveName: 'open-cowork-agent-drive',
  mountPath: '/workspace',
  drivePath: '/',
});
await sandbox.process.exec({
  command: "mkdir -p /workspace && printf 'incident repro\\n' > /workspace/incident.txt && sync && ls -la /workspace",
  waitForCompletion: true,
  workingDir: '/',
});
await sandbox.fs.read('/workspace/incident.txt');
```

Observed output:

```text
--- config ---
{
  "workspace": "openclawguy",
  "sandboxName": "open-cowork-drive-incident-20260513",
  "driveName": "open-cowork-agent-drive",
  "region": "us-was-1",
  "mountPath": "/workspace",
  "drivePath": "/",
  "image": "template-guardian"
}

--- drives.list.before.error ---
TypeError: fetch failed
    at async SandboxDrive.list (.../@blaxel/core/dist/esm/sandbox/drive/drive.js:55:26)

--- drives.mount.error ---
TypeError: fetch failed
    at async SandboxDrive.mount (.../@blaxel/core/dist/esm/sandbox/drive/drive.js:17:26)

[drives.list] raw response: {"mounts":[]}

--- drives.list.after ---
[]

--- fatal.error ---
TypeError: fetch failed
    at async SandboxProcess.exec (.../@blaxel/core/dist/esm/sandbox/process/process.js:111:47)
```

Local captured log:

- `personalV0/augment/server/sdk-drive-incident-2026-05-13.log`

## Repro Run 3: Direct Python SDK

Run time: `2026-05-13T18:03Z` approximate.

Sandbox intentionally left for Blaxel inspection:

- Sandbox name: `open-cowork-drive-incident-20260513-py`
- Workspace: `openclawguy`
- Region: `us-was-1`
- Image/template: `template-guardian`
- Drive: `open-cowork-agent-drive`
- Mount path: `/workspace`

Command shape:

```python
from blaxel.core import SandboxInstance
from blaxel.core.drive import DriveInstance

sandbox = await SandboxInstance.create_if_not_exists({
    "name": "open-cowork-drive-incident-20260513-py",
    "image": "template-guardian",
    "memory": 2048,
    "region": "us-was-1",
})

drive = await DriveInstance.create_if_not_exists({
    "name": "open-cowork-agent-drive",
    "region": "us-was-1",
    "display_name": "OpenCowork Agent Drive",
})

await sandbox.drives.list()
await sandbox.drives.mount(
    drive_name="open-cowork-agent-drive",
    mount_path="/workspace",
    drive_path="/",
)
await sandbox.process.exec({
    "command": "mkdir -p /workspace && printf 'python incident repro\\n' > /workspace/python-incident.txt && sync",
    "wait_for_completion": True,
    "working_dir": "/",
})
await sandbox.fs.read("/workspace/python-incident.txt")
```

Observed output:

```text
--- config ---
{
  "workspace": "openclawguy",
  "sandboxName": "open-cowork-drive-incident-20260513-py",
  "driveName": "open-cowork-agent-drive",
  "region": "us-was-1",
  "mountPath": "/workspace",
  "drivePath": "/",
  "image": "template-guardian"
}

--- drives.list.before.error ---
httpx.ConnectError
  File "...site-packages\\blaxel\\core\\sandbox\\default\\drive.py", line 72, in list
    response = await client.get("/drives/mount")

--- drives.mount.error ---
httpx.ConnectError
  File "...site-packages\\blaxel\\core\\sandbox\\default\\drive.py", line 40, in mount
    response = await client.post("/drives/mount", json=payload)

--- process.exec.write.error ---
httpx.ConnectError
  File "...site-packages\\blaxel\\core\\sandbox\\default\\process.py", line 252, in exec
    response = await client.post("/process", json=process.to_dict())

--- sandbox.fs.read.error ---
httpx.ConnectError
  File "...site-packages\\blaxel\\core\\sandbox\\default\\filesystem.py", line 156, in read
    response = await client.get(f"/filesystem/{path}")
```

Local captured log:

- `personalV0/augment/server/python-sdk-drive-incident-2026-05-13.log`

## Expected Behavior

- `sandbox.drives.list()` should reliably return mounted drives.
- `sandbox.drives.mount()` should mount `open-cowork-agent-drive` at `/workspace` or return a typed API error.
- `sandbox.process.exec()` should execute inside the sandbox once `SandboxInstance.createIfNotExists` returns a sandbox.
- Files written under `/workspace` should be visible through both:
  - shell/process reads inside the sandbox
  - filesystem/list/read APIs used by the SDK and our HTTP routes

## Actual Behavior

- TypeScript SDK intermittently returns `TypeError: fetch failed` for sandbox drive/process APIs.
- Python SDK returns `httpx.ConnectError` for sandbox drive/process/filesystem APIs.
- In the OpenCowork HTTP route repro, `drives.list` reports the drive mounted, but a file tree written into `/workspace` is not visible via the drive listing API.

## Interpretation

- This looks like a flaky sandbox/drive API or sandbox reachability issue, not a deterministic OpenCowork-only validation problem.
- A fresh control sandbox at `2026-05-13T18:08Z` succeeded on immediate post-mount write/read/list operations for eight attempts, so the issue is not simply "read immediately after mount always fails."
- A follow-up probe at `2026-05-13T18:10Z` saw first-call failures after sandbox standby/resume, followed by successful operations. That is the strongest current lead.
- Both SDKs fail at the sandbox API boundary in the fixed repros, before OpenCowork-specific file parsing or preview logic.
- The Python stack traces show failures calling sandbox API endpoints:
  - `GET /drives/mount`
  - `POST /drives/mount`
  - `POST /process`
  - `GET /filesystem/{path}`
- The TypeScript stack traces show the same classes of failures in `@blaxel/core`.
- The mounted drive is reported by Blaxel in one repro, but drive file visibility is inconsistent afterward.
- OpenCowork should improve its own error mapping so SDK/sandbox connection failures are not returned as generic HTTP `400`, but that mapping does not explain the underlying SDK connection failures.

## Request for Blaxel

Please inspect logs/metrics for:

- Workspace: `openclawguy`
- Sandbox: `open-cowork-drive-incident-20260513`
- Sandbox: `open-cowork-drive-incident-20260513-py`
- Session/sandbox prefix from HTTP repro: `78e63172` / `augment-78e63172...`
- Drive: `open-cowork-agent-drive`
- Time window: `2026-05-13T18:00:00Z` to `2026-05-13T18:05:00Z`

Questions:

1. Are the sandbox internal API endpoints failing to come up or becoming unreachable after sandbox creation/resume?
2. Are Agent Drive mounts succeeding but not surfacing a consistent filesystem view at `/workspace`?
3. Is `template-guardian` missing something required for sandbox API/drive support, or is this happening below the template layer?
4. Is there an account/quota/rate-limit condition that would cause `fetch failed`/`httpx.ConnectError` instead of a typed Blaxel API error?
5. Can the SDKs expose the underlying URL/status/error body for these sandbox API connection failures?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent sandbox/Agent Drive API failures after standby/resume #142

Blaxel Incident Report: Agent Drive File API Failures Through Sandbox

Summary

Environment

Repro Run 1: OpenCowork HTTP API Through Mounted Sandbox

Control Run: Same Drive Can Work Immediately After Mount

Follow-Up Probe: Existing Sandboxes After Standby

Repro Run 2: Direct TypeScript SDK

Repro Run 3: Direct Python SDK

Expected Behavior

Actual Behavior

Interpretation

Request for Blaxel

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Intermittent sandbox/Agent Drive API failures after standby/resume #142

Description

Blaxel Incident Report: Agent Drive File API Failures Through Sandbox

Summary

Environment

Repro Run 1: OpenCowork HTTP API Through Mounted Sandbox

Control Run: Same Drive Can Work Immediately After Mount

Follow-Up Probe: Existing Sandboxes After Standby

Repro Run 2: Direct TypeScript SDK

Repro Run 3: Direct Python SDK

Expected Behavior

Actual Behavior

Interpretation

Request for Blaxel

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions