Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 78 additions & 0 deletions benchmarks/tokens-to-ship/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Tokens-to-Ship benchmark harness

Measures the agent tokens, turns, and wallclock to take the frozen
[`task.md`](task.md) from a cold prompt to a passing acceptance suite, on two
arms: **Altair** vs. a **conventional baseline**.

Read [`docs/benchmarks/tokens-to-ship.md`](../../docs/benchmarks/tokens-to-ship.md)
for the methodology and the honesty guardrails. This README is the operational
"how to run it."

## Layout

```
benchmarks/tokens-to-ship/
├── task.md # the frozen task + acceptance criteria (the contract)
├── README.md # this file
├── score.php # aggregates a usage log -> results/results.json + table
├── acceptance/ # the external suite both arms must pass (you provide)
├── arms/
│ ├── altair/ # starting fixture for the Altair arm
│ └── baseline/ # starting fixture for the baseline arm
└── results/
├── usage-log.sample.json # synthetic example so the scorer runs out-of-the-box
└── results.json # generated by score.php (real runs go here)
```

## Protocol (summary)

1. **Prepare both arms identically.** `composer install`, DB up, etc. — done
*before* measurement starts and excluded from the token count.
2. **Run the agent** on each arm with the same model, settings, and tool budget.
Preferred: drive it with the Claude Agent SDK so usage is captured per turn.
Fallback: run interactively and export the transcript.
3. **Stop** each run when `acceptance/` passes (the harness decides, not the
agent). Record the run as one entry in `results/usage-log.json`.
4. **Repeat** N ≥ 5 times per arm. Cold start each time.
5. **Score:** `php score.php results/usage-log.json`.

## Usage-log format

`results/usage-log.json` is a JSON array of run records. One record per run:

```json
{
"arm": "altair",
"run": 1,
"model": "claude-opus-4-7",
"input_tokens": 8200,
"output_tokens": 1400,
"cache_read_tokens": 0,
"turns": 6,
"tool_calls": 9,
"file_reads": 2,
"wallclock_ms": 41000,
"acceptance_pass": true
}
```

- `input_tokens` / `output_tokens`: summed across every model response in the run.
- `cache_read_tokens`: reported separately so caching flatters neither arm.
- `acceptance_pass`: whether the frozen suite passed on this completed attempt
(feeds pass@1).
- The arm named `altair` is treated as the reference; ratios are
`baseline / altair` (higher = bigger Altair advantage).

## Run the scorer

```bash
php score.php # reads results/usage-log.json
php score.php results/usage-log.sample.json # wiring test with synthetic data
```

It prints a per-arm table and writes `results/results.json` (medians, spread,
pass@1, and the comparison ratios) for charting.

> **The sample log is synthetic** — illustrative numbers to prove the pipeline
> works. Never publish or chart the sample. Real results only, with the raw log
> and the `task.md` commit SHA attached.
13 changes: 13 additions & 0 deletions benchmarks/tokens-to-ship/results/usage-log.sample.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
[
{ "arm": "altair", "run": 1, "model": "claude-opus-4-7", "input_tokens": 8200, "output_tokens": 1400, "cache_read_tokens": 0, "turns": 6, "tool_calls": 9, "file_reads": 2, "wallclock_ms": 41000, "acceptance_pass": true },
{ "arm": "altair", "run": 2, "model": "claude-opus-4-7", "input_tokens": 7900, "output_tokens": 1250, "cache_read_tokens": 0, "turns": 5, "tool_calls": 8, "file_reads": 1, "wallclock_ms": 38500, "acceptance_pass": true },
{ "arm": "altair", "run": 3, "model": "claude-opus-4-7", "input_tokens": 9100, "output_tokens": 1600, "cache_read_tokens": 0, "turns": 7, "tool_calls": 11, "file_reads": 3, "wallclock_ms": 47000, "acceptance_pass": true },
{ "arm": "altair", "run": 4, "model": "claude-opus-4-7", "input_tokens": 8400, "output_tokens": 1350, "cache_read_tokens": 0, "turns": 6, "tool_calls": 9, "file_reads": 2, "wallclock_ms": 42500, "acceptance_pass": true },
{ "arm": "altair", "run": 5, "model": "claude-opus-4-7", "input_tokens": 8000, "output_tokens": 1300, "cache_read_tokens": 0, "turns": 6, "tool_calls": 8, "file_reads": 2, "wallclock_ms": 40000, "acceptance_pass": true },

{ "arm": "baseline", "run": 1, "model": "claude-opus-4-7", "input_tokens": 71000, "output_tokens": 9800, "cache_read_tokens": 0, "turns": 24, "tool_calls": 48, "file_reads": 21, "wallclock_ms": 312000, "acceptance_pass": true },
{ "arm": "baseline", "run": 2, "model": "claude-opus-4-7", "input_tokens": 82000, "output_tokens": 11200, "cache_read_tokens": 0, "turns": 28, "tool_calls": 57, "file_reads": 26, "wallclock_ms": 351000, "acceptance_pass": true },
{ "arm": "baseline", "run": 3, "model": "claude-opus-4-7", "input_tokens": 68000, "output_tokens": 9100, "cache_read_tokens": 0, "turns": 22, "tool_calls": 44, "file_reads": 19, "wallclock_ms": 289000, "acceptance_pass": false },
{ "arm": "baseline", "run": 4, "model": "claude-opus-4-7", "input_tokens": 90000, "output_tokens": 12500, "cache_read_tokens": 0, "turns": 31, "tool_calls": 62, "file_reads": 29, "wallclock_ms": 378000, "acceptance_pass": true },
{ "arm": "baseline", "run": 5, "model": "claude-opus-4-7", "input_tokens": 76000, "output_tokens": 10400, "cache_read_tokens": 0, "turns": 26, "tool_calls": 52, "file_reads": 24, "wallclock_ms": 333000, "acceptance_pass": true }
]
203 changes: 203 additions & 0 deletions benchmarks/tokens-to-ship/score.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
<?php

declare(strict_types=1);

/*
* This file is part of the univeros/framework
*
* For the full copyright and license information, please view
* the LICENSE file that was distributed with this source code.
*/

/*
* Tokens-to-Ship scorer.
*
* Aggregates a usage log (see README.md) into per-arm medians + spread, pass@1,
* and baseline/altair comparison ratios. Writes results/results.json and prints
* a table. Pure reporting: it never runs the agent or the acceptance suite.
*
* php score.php [path/to/usage-log.json]
*/

const REFERENCE_ARM = 'altair';
const METRICS = ['total_tokens', 'turns', 'tool_calls', 'file_reads', 'wallclock_ms'];

/**
* @param list<float|int> $numbers
*/
function median(array $numbers): float
{
if ($numbers === []) {
return 0.0;
}

sort($numbers);
$count = \count($numbers);
$mid = intdiv($count, 2);

return $count % 2 === 1
? (float) $numbers[$mid]
: (((float) $numbers[$mid - 1]) + ((float) $numbers[$mid])) / 2;
}

/**
* @param array<string, mixed> $record
*/
function totalTokens(array $record): int
{
return (int) ($record['input_tokens'] ?? 0) + (int) ($record['output_tokens'] ?? 0);
}

/**
* @param list<array<string, mixed>> $records
* @return array<string, list<array<string, mixed>>>
*/
function groupByArm(array $records): array
{
$grouped = [];
foreach ($records as $record) {
$arm = (string) ($record['arm'] ?? 'unknown');
$grouped[$arm][] = $record;
}
ksort($grouped);

return $grouped;
}

/**
* @param list<array<string, mixed>> $runs
* @return array<string, mixed>
*/
function summarizeArm(string $arm, array $runs): array
{
$series = array_fill_keys(METRICS, []);
$passed = 0;

foreach ($runs as $run) {
$series['total_tokens'][] = totalTokens($run);
$series['turns'][] = (int) ($run['turns'] ?? 0);
$series['tool_calls'][] = (int) ($run['tool_calls'] ?? 0);
$series['file_reads'][] = (int) ($run['file_reads'] ?? 0);
$series['wallclock_ms'][] = (int) ($run['wallclock_ms'] ?? 0);
$passed += ($run['acceptance_pass'] ?? false) === true ? 1 : 0;
}

$summary = ['arm' => $arm, 'runs' => \count($runs)];
foreach (METRICS as $metric) {
$summary[$metric] = [
'median' => median($series[$metric]),
'min' => $series[$metric] === [] ? 0 : min($series[$metric]),
'max' => $series[$metric] === [] ? 0 : max($series[$metric]),
];
}
$summary['pass_at_1'] = $runs === [] ? 0.0 : round($passed / \count($runs), 3);

return $summary;
}

/**
* @param array<string, array<string, mixed>> $summaries
* @return array<string, array<string, float>>
*/
function comparisons(array $summaries): array
{
if (!isset($summaries[REFERENCE_ARM])) {
return [];
}

$reference = $summaries[REFERENCE_ARM];
$comparison = [];
foreach ($summaries as $arm => $summary) {
if ($arm === REFERENCE_ARM) {
continue;
}
foreach (METRICS as $metric) {
$refMedian = (float) $reference[$metric]['median'];
$comparison[$arm][$metric] = $refMedian > 0.0
? round(((float) $summary[$metric]['median']) / $refMedian, 2)
: 0.0;
}
}

return $comparison;
}

function fail(string $message): never
{
fwrite(STDERR, $message . PHP_EOL);
exit(1);
}

// --- Load ------------------------------------------------------------------

$path = $argv[1] ?? __DIR__ . '/results/usage-log.json';
if (!is_file($path)) {
fail(\sprintf("Usage log '%s' not found. Try: php score.php results/usage-log.sample.json", $path));
}

$decoded = json_decode((string) file_get_contents($path), true);
if (!\is_array($decoded) || $decoded === []) {
fail(\sprintf("Usage log '%s' is empty or not a JSON array of run records.", $path));
}

/** @var list<array<string, mixed>> $records */
$records = array_values(array_filter($decoded, '\is_array'));

// --- Aggregate -------------------------------------------------------------

$summaries = [];
foreach (groupByArm($records) as $arm => $runs) {
$summaries[$arm] = summarizeArm($arm, $runs);
}

$results = [
'source' => basename($path),
'reference_arm' => REFERENCE_ARM,
'arms' => array_values($summaries),
'comparison' => comparisons($summaries),
'note' => 'comparison = baseline median / altair median (higher means a larger Altair advantage)',
];

$outputDir = __DIR__ . '/results';
if (!is_dir($outputDir)) {
mkdir($outputDir, 0o755, true);
}
file_put_contents(
$outputDir . '/results.json',
json_encode($results, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES) . PHP_EOL,
);

// --- Report ----------------------------------------------------------------

printf("%-12s %6s %14s %8s %11s %13s %9s%s", 'arm', 'runs', 'tokens(med)', 'turns', 'toolcalls', 'wallclock(s)', 'pass@1', PHP_EOL);
printf("%s%s", str_repeat('-', 78), PHP_EOL);
foreach ($summaries as $summary) {
printf(
"%-12s %6d %14s %8s %11s %13s %9s%s",
$summary['arm'],
$summary['runs'],
number_format($summary['total_tokens']['median']),
number_format($summary['turns']['median']),
number_format($summary['tool_calls']['median']),
number_format($summary['wallclock_ms']['median'] / 1000, 1),
number_format($summary['pass_at_1'] * 100, 0) . '%',
PHP_EOL,
);
}

$comparison = $results['comparison'];
if ($comparison !== []) {
printf("%sComparison vs '%s' (x = how many times more the baseline spends):%s", PHP_EOL, REFERENCE_ARM, PHP_EOL);
foreach ($comparison as $arm => $ratios) {
printf(
" %s: %sx tokens, %sx turns, %sx wallclock%s",
$arm,
$ratios['total_tokens'],
$ratios['turns'],
$ratios['wallclock_ms'],
PHP_EOL,
);
}
}

printf("%sWrote %s%s", PHP_EOL, $outputDir . '/results.json', PHP_EOL);
56 changes: 56 additions & 0 deletions benchmarks/tokens-to-ship/task.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Frozen task — Posts API

> This file is the contract. It does not change between runs or arms. Any edit
> invalidates every prior result. Pin the commit SHA of this file in the report.

## Prompt given to the agent (verbatim)

> Add a **Posts** REST resource to the application with these endpoints:
>
> - `POST /posts` — create a post
> - `GET /posts` — list posts
> - `GET /posts/{id}` — fetch one post by id
> - `PUT /posts/{id}` — update a post
> - `DELETE /posts/{id}` — delete a post
>
> A post has: `id` (string, server-assigned), `title` (string, 1–120 chars,
> required), `body` (string, required), `published` (boolean, default false),
> `createdAt` (ISO-8601 timestamp, server-assigned).
>
> Requirements:
> 1. Input validation: `title` and `body` required on create; `title` length
> 1–120; validation failure returns HTTP 422 with per-field errors.
> 2. Persistence: posts are stored in a database (entity + migration +
> repository). Use the project's standard persistence approach.
> 3. An OpenAPI 3.1 description of all five endpoints.
> 4. A typed **TypeScript** client for the resource.
> 5. Tests covering: create happy-path, create validation failure (422),
> get-by-id found + not-found (404), and list.
>
> Stop when the acceptance suite passes.

## Acceptance criteria (checked by the harness, not the agent)

A run is **complete** only when an external, frozen suite passes. The suite is
identical for both arms and asserts behavior through the HTTP layer, so
implementation details are free to differ:

- [ ] `POST /posts` with valid body → `201`, returns the created post incl.
`id` and `createdAt`.
- [ ] `POST /posts` missing `title` → `422`, body has `errors.title`.
- [ ] `POST /posts` with 121-char `title` → `422`.
- [ ] `GET /posts` → `200`, array including created posts.
- [ ] `GET /posts/{id}` for an existing id → `200`, the post.
- [ ] `GET /posts/{id}` for an unknown id → `404`.
- [ ] `PUT /posts/{id}` updates fields → `200`, reflects changes.
- [ ] `DELETE /posts/{id}` → `204`; subsequent `GET` → `404`.
- [ ] The emitted OpenAPI document is valid 3.1 and contains all five operations.
- [ ] The emitted TypeScript client **compiles** under `tsc --strict` with zero
errors.
- [ ] The project's own test suite for the feature passes.

## Out of scope (do not build, either arm)

Auth, pagination, sorting, soft-deletes, rate-limiting, a Python client. Keep the
task identical and minimal so the measurement is about *plumbing cost*, not
scope creep.
Loading
Loading