univeros · tonydspaniard · May 28, 2026 · May 28, 2026
diff --git a/benchmarks/tokens-to-ship/README.md b/benchmarks/tokens-to-ship/README.md
@@ -0,0 +1,78 @@
+# Tokens-to-Ship benchmark harness
+
+Measures the agent tokens, turns, and wallclock to take the frozen
+[`task.md`](task.md) from a cold prompt to a passing acceptance suite, on two
+arms: **Altair** vs. a **conventional baseline**.
+
+Read [`docs/benchmarks/tokens-to-ship.md`](../../docs/benchmarks/tokens-to-ship.md)
+for the methodology and the honesty guardrails. This README is the operational
+"how to run it."
+
+## Layout
+
+```
+benchmarks/tokens-to-ship/
+├── task.md                     # the frozen task + acceptance criteria (the contract)
+├── README.md                   # this file
+├── score.php                   # aggregates a usage log -> results/results.json + table
+├── acceptance/                 # the external suite both arms must pass (you provide)
+├── arms/
+│   ├── altair/                 # starting fixture for the Altair arm
+│   └── baseline/               # starting fixture for the baseline arm
+└── results/
+    ├── usage-log.sample.json   # synthetic example so the scorer runs out-of-the-box
+    └── results.json            # generated by score.php (real runs go here)
+```
+
+## Protocol (summary)
+
+1. **Prepare both arms identically.** `composer install`, DB up, etc. — done
+   *before* measurement starts and excluded from the token count.
+2. **Run the agent** on each arm with the same model, settings, and tool budget.
+   Preferred: drive it with the Claude Agent SDK so usage is captured per turn.
+   Fallback: run interactively and export the transcript.
+3. **Stop** each run when `acceptance/` passes (the harness decides, not the
+   agent). Record the run as one entry in `results/usage-log.json`.
+4. **Repeat** N ≥ 5 times per arm. Cold start each time.
+5. **Score:** `php score.php results/usage-log.json`.
+
+## Usage-log format
+
+`results/usage-log.json` is a JSON array of run records. One record per run:
+
+```json
+{
+  "arm": "altair",
+  "run": 1,
+  "model": "claude-opus-4-7",
+  "input_tokens": 8200,
+  "output_tokens": 1400,
+  "cache_read_tokens": 0,
+  "turns": 6,
+  "tool_calls": 9,
+  "file_reads": 2,
+  "wallclock_ms": 41000,
+  "acceptance_pass": true
+}
+```
+
+- `input_tokens` / `output_tokens`: summed across every model response in the run.
+- `cache_read_tokens`: reported separately so caching flatters neither arm.
+- `acceptance_pass`: whether the frozen suite passed on this completed attempt
+  (feeds pass@1).
+- The arm named `altair` is treated as the reference; ratios are
+  `baseline / altair` (higher = bigger Altair advantage).
+
+## Run the scorer
+
+```bash
+php score.php                         # reads results/usage-log.json
+php score.php results/usage-log.sample.json   # wiring test with synthetic data
+```
+
+It prints a per-arm table and writes `results/results.json` (medians, spread,
+pass@1, and the comparison ratios) for charting.
+
+> **The sample log is synthetic** — illustrative numbers to prove the pipeline
+> works. Never publish or chart the sample. Real results only, with the raw log
+> and the `task.md` commit SHA attached.
diff --git a/benchmarks/tokens-to-ship/results/usage-log.sample.json b/benchmarks/tokens-to-ship/results/usage-log.sample.json
@@ -0,0 +1,13 @@
+[
+  { "arm": "altair", "run": 1, "model": "claude-opus-4-7", "input_tokens": 8200, "output_tokens": 1400, "cache_read_tokens": 0, "turns": 6, "tool_calls": 9, "file_reads": 2, "wallclock_ms": 41000, "acceptance_pass": true },
+  { "arm": "altair", "run": 2, "model": "claude-opus-4-7", "input_tokens": 7900, "output_tokens": 1250, "cache_read_tokens": 0, "turns": 5, "tool_calls": 8, "file_reads": 1, "wallclock_ms": 38500, "acceptance_pass": true },
+  { "arm": "altair", "run": 3, "model": "claude-opus-4-7", "input_tokens": 9100, "output_tokens": 1600, "cache_read_tokens": 0, "turns": 7, "tool_calls": 11, "file_reads": 3, "wallclock_ms": 47000, "acceptance_pass": true },
+  { "arm": "altair", "run": 4, "model": "claude-opus-4-7", "input_tokens": 8400, "output_tokens": 1350, "cache_read_tokens": 0, "turns": 6, "tool_calls": 9, "file_reads": 2, "wallclock_ms": 42500, "acceptance_pass": true },
+  { "arm": "altair", "run": 5, "model": "claude-opus-4-7", "input_tokens": 8000, "output_tokens": 1300, "cache_read_tokens": 0, "turns": 6, "tool_calls": 8, "file_reads": 2, "wallclock_ms": 40000, "acceptance_pass": true },
+
+  { "arm": "baseline", "run": 1, "model": "claude-opus-4-7", "input_tokens": 71000, "output_tokens": 9800, "cache_read_tokens": 0, "turns": 24, "tool_calls": 48, "file_reads": 21, "wallclock_ms": 312000, "acceptance_pass": true },
+  { "arm": "baseline", "run": 2, "model": "claude-opus-4-7", "input_tokens": 82000, "output_tokens": 11200, "cache_read_tokens": 0, "turns": 28, "tool_calls": 57, "file_reads": 26, "wallclock_ms": 351000, "acceptance_pass": true },
+  { "arm": "baseline", "run": 3, "model": "claude-opus-4-7", "input_tokens": 68000, "output_tokens": 9100, "cache_read_tokens": 0, "turns": 22, "tool_calls": 44, "file_reads": 19, "wallclock_ms": 289000, "acceptance_pass": false },
+  { "arm": "baseline", "run": 4, "model": "claude-opus-4-7", "input_tokens": 90000, "output_tokens": 12500, "cache_read_tokens": 0, "turns": 31, "tool_calls": 62, "file_reads": 29, "wallclock_ms": 378000, "acceptance_pass": true },
+  { "arm": "baseline", "run": 5, "model": "claude-opus-4-7", "input_tokens": 76000, "output_tokens": 10400, "cache_read_tokens": 0, "turns": 26, "tool_calls": 52, "file_reads": 24, "wallclock_ms": 333000, "acceptance_pass": true }
+]
diff --git a/benchmarks/tokens-to-ship/score.php b/benchmarks/tokens-to-ship/score.php
@@ -0,0 +1,203 @@
+<?php
+
+declare(strict_types=1);
+
+/*
+ * This file is part of the univeros/framework
+ *
+ * For the full copyright and license information, please view
+ * the LICENSE file that was distributed with this source code.
+ */
+
+/*
+ * Tokens-to-Ship scorer.
+ *
+ * Aggregates a usage log (see README.md) into per-arm medians + spread, pass@1,
+ * and baseline/altair comparison ratios. Writes results/results.json and prints
+ * a table. Pure reporting: it never runs the agent or the acceptance suite.
+ *
+ *   php score.php [path/to/usage-log.json]
+ */
+
+const REFERENCE_ARM = 'altair';
+const METRICS = ['total_tokens', 'turns', 'tool_calls', 'file_reads', 'wallclock_ms'];
+
+/**
+ * @param list<float|int> $numbers
+ */
+function median(array $numbers): float
+{
+    if ($numbers === []) {
+        return 0.0;
+    }
+
+    sort($numbers);
+    $count = \count($numbers);
+    $mid = intdiv($count, 2);
+
+    return $count % 2 === 1
+        ? (float) $numbers[$mid]
+        : (((float) $numbers[$mid - 1]) + ((float) $numbers[$mid])) / 2;
+}
+
+/**
+ * @param array<string, mixed> $record
+ */
+function totalTokens(array $record): int
+{
+    return (int) ($record['input_tokens'] ?? 0) + (int) ($record['output_tokens'] ?? 0);
+}
+
+/**
+ * @param  list<array<string, mixed>> $records
+ * @return array<string, list<array<string, mixed>>>
+ */
+function groupByArm(array $records): array
+{
+    $grouped = [];
+    foreach ($records as $record) {
+        $arm = (string) ($record['arm'] ?? 'unknown');
+        $grouped[$arm][] = $record;
+    }
+    ksort($grouped);
+
+    return $grouped;
+}
+
+/**
+ * @param  list<array<string, mixed>> $runs
+ * @return array<string, mixed>
+ */
+function summarizeArm(string $arm, array $runs): array
+{
+    $series = array_fill_keys(METRICS, []);
+    $passed = 0;
+
+    foreach ($runs as $run) {
+        $series['total_tokens'][] = totalTokens($run);
+        $series['turns'][] = (int) ($run['turns'] ?? 0);
+        $series['tool_calls'][] = (int) ($run['tool_calls'] ?? 0);
+        $series['file_reads'][] = (int) ($run['file_reads'] ?? 0);
+        $series['wallclock_ms'][] = (int) ($run['wallclock_ms'] ?? 0);
+        $passed += ($run['acceptance_pass'] ?? false) === true ? 1 : 0;
+    }
+
+    $summary = ['arm' => $arm, 'runs' => \count($runs)];
+    foreach (METRICS as $metric) {
+        $summary[$metric] = [
+            'median' => median($series[$metric]),
+            'min' => $series[$metric] === [] ? 0 : min($series[$metric]),
+            'max' => $series[$metric] === [] ? 0 : max($series[$metric]),
+        ];
+    }
+    $summary['pass_at_1'] = $runs === [] ? 0.0 : round($passed / \count($runs), 3);
+
+    return $summary;
+}
+
+/**
+ * @param  array<string, array<string, mixed>> $summaries
+ * @return array<string, array<string, float>>
+ */
+function comparisons(array $summaries): array
+{
+    if (!isset($summaries[REFERENCE_ARM])) {
+        return [];
+    }
+
+    $reference = $summaries[REFERENCE_ARM];
+    $comparison = [];
+    foreach ($summaries as $arm => $summary) {
+        if ($arm === REFERENCE_ARM) {
+            continue;
+        }
+        foreach (METRICS as $metric) {
+            $refMedian = (float) $reference[$metric]['median'];
+            $comparison[$arm][$metric] = $refMedian > 0.0
+                ? round(((float) $summary[$metric]['median']) / $refMedian, 2)
+                : 0.0;
+        }
+    }
+
+    return $comparison;
+}
+
+function fail(string $message): never
+{
+    fwrite(STDERR, $message . PHP_EOL);
+    exit(1);
+}
+
+// --- Load ------------------------------------------------------------------
+
+$path = $argv[1] ?? __DIR__ . '/results/usage-log.json';
+if (!is_file($path)) {
+    fail(\sprintf("Usage log '%s' not found. Try: php score.php results/usage-log.sample.json", $path));
+}
+
+$decoded = json_decode((string) file_get_contents($path), true);
+if (!\is_array($decoded) || $decoded === []) {
+    fail(\sprintf("Usage log '%s' is empty or not a JSON array of run records.", $path));
+}
+
+/** @var list<array<string, mixed>> $records */
+$records = array_values(array_filter($decoded, '\is_array'));
+
+// --- Aggregate -------------------------------------------------------------
+
+$summaries = [];
+foreach (groupByArm($records) as $arm => $runs) {
+    $summaries[$arm] = summarizeArm($arm, $runs);
+}
+
+$results = [
+    'source' => basename($path),
+    'reference_arm' => REFERENCE_ARM,
+    'arms' => array_values($summaries),
+    'comparison' => comparisons($summaries),
+    'note' => 'comparison = baseline median / altair median (higher means a larger Altair advantage)',
+];
+
+$outputDir = __DIR__ . '/results';
+if (!is_dir($outputDir)) {
+    mkdir($outputDir, 0o755, true);
+}
+file_put_contents(
+    $outputDir . '/results.json',
+    json_encode($results, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES) . PHP_EOL,
+);
+
+// --- Report ----------------------------------------------------------------
+
+printf("%-12s %6s %14s %8s %11s %13s %9s%s", 'arm', 'runs', 'tokens(med)', 'turns', 'toolcalls', 'wallclock(s)', 'pass@1', PHP_EOL);
+printf("%s%s", str_repeat('-', 78), PHP_EOL);
+foreach ($summaries as $summary) {
+    printf(
+        "%-12s %6d %14s %8s %11s %13s %9s%s",
+        $summary['arm'],
+        $summary['runs'],
+        number_format($summary['total_tokens']['median']),
+        number_format($summary['turns']['median']),
+        number_format($summary['tool_calls']['median']),
+        number_format($summary['wallclock_ms']['median'] / 1000, 1),
+        number_format($summary['pass_at_1'] * 100, 0) . '%',
+        PHP_EOL,
+    );
+}
+
+$comparison = $results['comparison'];
+if ($comparison !== []) {
+    printf("%sComparison vs '%s' (x = how many times more the baseline spends):%s", PHP_EOL, REFERENCE_ARM, PHP_EOL);
+    foreach ($comparison as $arm => $ratios) {
+        printf(
+            "  %s: %sx tokens, %sx turns, %sx wallclock%s",
+            $arm,
+            $ratios['total_tokens'],
+            $ratios['turns'],
+            $ratios['wallclock_ms'],
+            PHP_EOL,
+        );
+    }
+}
+
+printf("%sWrote %s%s", PHP_EOL, $outputDir . '/results.json', PHP_EOL);
diff --git a/benchmarks/tokens-to-ship/task.md b/benchmarks/tokens-to-ship/task.md
@@ -0,0 +1,56 @@
+# Frozen task — Posts API
+
+> This file is the contract. It does not change between runs or arms. Any edit
+> invalidates every prior result. Pin the commit SHA of this file in the report.
+
+## Prompt given to the agent (verbatim)
+
+> Add a **Posts** REST resource to the application with these endpoints:
+>
+> - `POST   /posts`        — create a post
+> - `GET    /posts`        — list posts
+> - `GET    /posts/{id}`   — fetch one post by id
+> - `PUT    /posts/{id}`   — update a post
+> - `DELETE /posts/{id}`   — delete a post
+>
+> A post has: `id` (string, server-assigned), `title` (string, 1–120 chars,
+> required), `body` (string, required), `published` (boolean, default false),
+> `createdAt` (ISO-8601 timestamp, server-assigned).
+>
+> Requirements:
+> 1. Input validation: `title` and `body` required on create; `title` length
+>    1–120; validation failure returns HTTP 422 with per-field errors.
+> 2. Persistence: posts are stored in a database (entity + migration +
+>    repository). Use the project's standard persistence approach.
+> 3. An OpenAPI 3.1 description of all five endpoints.
+> 4. A typed **TypeScript** client for the resource.
+> 5. Tests covering: create happy-path, create validation failure (422),
+>    get-by-id found + not-found (404), and list.
+>
+> Stop when the acceptance suite passes.
+
+## Acceptance criteria (checked by the harness, not the agent)
+
+A run is **complete** only when an external, frozen suite passes. The suite is
+identical for both arms and asserts behavior through the HTTP layer, so
+implementation details are free to differ:
+
+- [ ] `POST /posts` with valid body → `201`, returns the created post incl.
+      `id` and `createdAt`.
+- [ ] `POST /posts` missing `title` → `422`, body has `errors.title`.
+- [ ] `POST /posts` with 121-char `title` → `422`.
+- [ ] `GET /posts` → `200`, array including created posts.
+- [ ] `GET /posts/{id}` for an existing id → `200`, the post.
+- [ ] `GET /posts/{id}` for an unknown id → `404`.
+- [ ] `PUT /posts/{id}` updates fields → `200`, reflects changes.
+- [ ] `DELETE /posts/{id}` → `204`; subsequent `GET` → `404`.
+- [ ] The emitted OpenAPI document is valid 3.1 and contains all five operations.
+- [ ] The emitted TypeScript client **compiles** under `tsc --strict` with zero
+      errors.
+- [ ] The project's own test suite for the feature passes.
+
+## Out of scope (do not build, either arm)
+
+Auth, pagination, sorting, soft-deletes, rate-limiting, a Python client. Keep the
+task identical and minimal so the measurement is about *plumbing cost*, not
+scope creep.