Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 9 additions & 15 deletions METHODOLOGY.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ const start = performance.now();
const sandbox = await compute.sandbox.create();

// 3. Execute a trivial command to confirm interactivity
await sandbox.runCommand('echo "benchmark"');
await sandbox.runCommand('node -v');

// 4. Stop timer
const ttiMs = performance.now() - start;
Expand All @@ -60,13 +60,13 @@ const ttiMs = performance.now() - start;
await sandbox.destroy();
```

### Why `echo "benchmark"`?
### Why `node -v`?

We use a minimal command to isolate sandbox startup time from command complexity. The command:
- Has negligible execution time
- Requires no file system access
- Produces deterministic output
- Validates the full request/response cycle
- Confirms the Node.js runtime is available and functional

## Test Modes

Expand Down Expand Up @@ -178,12 +178,9 @@ For each provider, we report:

| Metric | Description |
|--------|-------------|
| **Min** | Fastest iteration (best case) |
| **Max** | Slowest iteration (worst case) |
| **Median** | Middle value (typical case) |
| **P95** | 95th percentile (tail latency) |
| **P99** | 99th percentile (extreme tail) |
| **Average** | Arithmetic mean |
| **Success Rate** | Iterations completed without error |

We emphasize **median** as the primary metric because it's robust to outliers and represents the typical developer experience.
Expand All @@ -204,15 +201,15 @@ A 200ms median scores 98. A 4,000ms median scores 60. Anything at or above 10s s

The **timingScore** is a weighted sum of individual metric scores. The **successRate** (0–1) acts as a linear multiplier — a provider with 50% success has its score halved.

Before computing timing statistics, the bottom 5% and top 5% of successful iteration times are trimmed to reduce the influence of outliers caused by transient network issues or cold-start anomalies. Min and max values are still computed from the full dataset for display purposes but are not used in scoring.

**Timing weights** (sum to 1.0):

| Metric | Weight | Rationale |
|--------|--------|-----------|
| Median | 0.50 | Primary signal — typical developer experience |
| P95 | 0.20 | Tail latency — consistency matters |
| Max | 0.15 | Worst-case exposure |
| P99 | 0.10 | Extreme tail |
| Min | 0.05 | Best-case capability |
| Median | 0.60 | Primary signal — typical developer experience |
| P95 | 0.25 | Tail latency — consistency matters |
| P99 | 0.15 | Extreme tail — worst-case exposure |

**Why multiplicative?** A provider with lower than 100% success rate shouldn't rank above a provider with 100% success and a slightly slower median. The multiplicative penalty ensures reliability is non-negotiable — a provider must be both fast *and* reliable to score well.

Expand Down Expand Up @@ -283,12 +280,9 @@ Each test mode generates its own SVG visualization: `sequential_tti.svg`, `stagg
],
"summary": {
"ttiMs": {
"min": 100.0,
"max": 150.0,
"median": 125.0,
"p95": 140.0,
"p99": 148.0,
"avg": 124.5
"p99": 148.0
}
},
"compositeScore": 96.85,
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ API Request → Provisioning → Boot → Ready → First Command
└───────────────────── TTI ─────────────────────┘
```

Each benchmark creates a fresh sandbox, runs `echo "benchmark"`, and records wall-clock time. 100 iterations per provider, every day, fully automated.
Each benchmark creates a fresh sandbox, runs `node -v`, and records wall-clock time. 100 iterations per provider, every day, fully automated.

**Powered by ComputeSDK** — We use [ComputeSDK](https://github.com/computesdk/computesdk), a multi-provider SDK, to test all sandbox providers with the same code. One API, multiple providers, fair comparison. Interested in multi-provider failover, sandbox packing, and warm pooling? [Check out ComputeSDK](https://github.com/computesdk/computesdk).

Expand All @@ -30,7 +30,7 @@ Each benchmark creates a fresh sandbox, runs `echo "benchmark"`, and records wal

## Methodology

Each benchmark creates a fresh sandbox, runs `echo "benchmark"`, and records wall-clock time. We run three test modes daily:
Each benchmark creates a fresh sandbox, runs `node -v`, and records wall-clock time. We run three test modes daily:

**Sequential** — Sandboxes are created one at a time. Each is created, tested, and destroyed before the next begins. 100 iterations per provider. This is the baseline — isolated cold-start performance with no contention.

Expand All @@ -42,7 +42,7 @@ For each provider we report min, max, median, P95, P99, and average TTI, plus a

### Composite Score

Each timing metric is scored against a fixed 10-second ceiling: `score = 100 × (1 − value / 10,000ms)`. A 200ms median scores 98; anything ≥10s scores 0. These individual scores are combined with weighted emphasis on median (50%), P95 (20%), max (15%), P99 (10%), and min (5%), then multiplied by the provider's success rate (0–1). A provider with 90% success has its score reduced by 10% — reliability is non-negotiable.
Before computing timing statistics, the bottom 5% and top 5% of successful iterations are trimmed to reduce outlier influence from transient network issues or cold-start anomalies. Each timing metric is then scored against a fixed 10-second ceiling: `score = 100 × (1 − value / 10,000ms)`. A 200ms median scores 98; anything ≥10s scores 0. These individual scores are combined with weighted emphasis on median (60%), P95 (25%), and P99 (15%), then multiplied by the provider's success rate (0–1). A provider with 90% success has its score reduced by 10% — reliability is non-negotiable.

All tests run on GitHub Actions at 00:00 UTC daily. Providers are tested using ComputeSDK — no gateway or proxy layer.

Expand Down
Loading