computesdk · dtice25 · Mar 16, 2026 · Mar 16, 2026 · Mar 16, 2026 · Mar 16, 2026
diff --git a/METHODOLOGY.md b/METHODOLOGY.md
@@ -51,7 +51,7 @@ const start = performance.now();
 const sandbox = await compute.sandbox.create();
 
 // 3. Execute a trivial command to confirm interactivity
-await sandbox.runCommand('echo "benchmark"');
+await sandbox.runCommand('node -v');
 
 // 4. Stop timer
 const ttiMs = performance.now() - start;
@@ -60,13 +60,13 @@ const ttiMs = performance.now() - start;
 await sandbox.destroy();
 ```
 
-### Why `echo "benchmark"`?
+### Why `node -v`?
 
 We use a minimal command to isolate sandbox startup time from command complexity. The command:
 - Has negligible execution time
-- Requires no file system access
 - Produces deterministic output
 - Validates the full request/response cycle
+- Confirms the Node.js runtime is available and functional
 
 ## Test Modes
 
@@ -178,12 +178,9 @@ For each provider, we report:
 
 | Metric | Description |
 |--------|-------------|
-| **Min** | Fastest iteration (best case) |
-| **Max** | Slowest iteration (worst case) |
 | **Median** | Middle value (typical case) |
 | **P95** | 95th percentile (tail latency) |
 | **P99** | 99th percentile (extreme tail) |
-| **Average** | Arithmetic mean |
 | **Success Rate** | Iterations completed without error |
 
 We emphasize **median** as the primary metric because it's robust to outliers and represents the typical developer experience.
@@ -204,15 +201,15 @@ A 200ms median scores 98. A 4,000ms median scores 60. Anything at or above 10s s
 
 The **timingScore** is a weighted sum of individual metric scores. The **successRate** (0–1) acts as a linear multiplier — a provider with 50% success has its score halved.
 
+Before computing timing statistics, the bottom 5% and top 5% of successful iteration times are trimmed to reduce the influence of outliers caused by transient network issues or cold-start anomalies. Min and max values are still computed from the full dataset for display purposes but are not used in scoring.
+
 **Timing weights** (sum to 1.0):
 
 | Metric | Weight | Rationale |
 |--------|--------|-----------|
-| Median | 0.50 | Primary signal — typical developer experience |
-| P95 | 0.20 | Tail latency — consistency matters |
-| Max | 0.15 | Worst-case exposure |
-| P99 | 0.10 | Extreme tail |
-| Min | 0.05 | Best-case capability |
+| Median | 0.60 | Primary signal — typical developer experience |
+| P95 | 0.25 | Tail latency — consistency matters |
+| P99 | 0.15 | Extreme tail — worst-case exposure |
 
 **Why multiplicative?** A provider with lower than 100% success rate shouldn't rank above a provider with 100% success and a slightly slower median. The multiplicative penalty ensures reliability is non-negotiable — a provider must be both fast *and* reliable to score well.
 
@@ -283,12 +280,9 @@ Each test mode generates its own SVG visualization: `sequential_tti.svg`, `stagg
       ],
       "summary": {
         "ttiMs": {
-          "min": 100.0,
-          "max": 150.0,
           "median": 125.0,
           "p95": 140.0,
-          "p99": 148.0,
-          "avg": 124.5
+          "p99": 148.0
         }
       },
       "compositeScore": 96.85,

diff --git a/README.md b/README.md
@@ -20,7 +20,7 @@ API Request → Provisioning → Boot → Ready → First Command
 └───────────────────── TTI ─────────────────────┘
 ```
 
-Each benchmark creates a fresh sandbox, runs `echo "benchmark"`, and records wall-clock time. 100 iterations per provider, every day, fully automated.
+Each benchmark creates a fresh sandbox, runs `node -v`, and records wall-clock time. 100 iterations per provider, every day, fully automated.
 
 **Powered by ComputeSDK** — We use [ComputeSDK](https://github.com/computesdk/computesdk), a multi-provider SDK, to test all sandbox providers with the same code. One API, multiple providers, fair comparison. Interested in multi-provider failover, sandbox packing, and warm pooling? [Check out ComputeSDK](https://github.com/computesdk/computesdk).
 
@@ -30,7 +30,7 @@ Each benchmark creates a fresh sandbox, runs `echo "benchmark"`, and records wal
 
 ## Methodology
 
-Each benchmark creates a fresh sandbox, runs `echo "benchmark"`, and records wall-clock time. We run three test modes daily:
+Each benchmark creates a fresh sandbox, runs `node -v`, and records wall-clock time. We run three test modes daily:
 
 **Sequential** — Sandboxes are created one at a time. Each is created, tested, and destroyed before the next begins. 100 iterations per provider. This is the baseline — isolated cold-start performance with no contention.
 
@@ -42,7 +42,7 @@ For each provider we report min, max, median, P95, P99, and average TTI, plus a
 
 ### Composite Score
 
-Each timing metric is scored against a fixed 10-second ceiling: `score = 100 × (1 − value / 10,000ms)`. A 200ms median scores 98; anything ≥10s scores 0. These individual scores are combined with weighted emphasis on median (50%), P95 (20%), max (15%), P99 (10%), and min (5%), then multiplied by the provider's success rate (0–1). A provider with 90% success has its score reduced by 10% — reliability is non-negotiable.
+Before computing timing statistics, the bottom 5% and top 5% of successful iterations are trimmed to reduce outlier influence from transient network issues or cold-start anomalies. Each timing metric is then scored against a fixed 10-second ceiling: `score = 100 × (1 − value / 10,000ms)`. A 200ms median scores 98; anything ≥10s scores 0. These individual scores are combined with weighted emphasis on median (60%), P95 (25%), and P99 (15%), then multiplied by the provider's success rate (0–1). A provider with 90% success has its score reduced by 10% — reliability is non-negotiable.
 
 All tests run on GitHub Actions at 00:00 UTC daily. Providers are tested using ComputeSDK — no gateway or proxy layer.