From 39434a0819d51db5de37d21fd7346b30a845a8c0 Mon Sep 17 00:00:00 2001
From: Damon Blais <damon.blais@gmail.com>
Date: Thu, 28 May 2026 21:05:39 -0700
Subject: [PATCH] docs(spec): score incomplete runs + minimal-capability test
 tier (refs #23)

Adds a design specification covering failure-reason taxonomy (9-code
enum extending the existing failure_reason/error fields), completeness-
weighted partial scoring formula with worked example, a 12-task
minimal-capability ('dumb model') floor tier using deterministic scorers,
and the result.json / manifest.json schema additions the bakeoff-results
site needs to render failure reasons, partial scores, and incomplete/failed
badges. All new fields are additive; bakeoff-results/v1 compatibility
is preserved.
---
 specs/index.md                                |   1 +
 .../spec.md                                   | 484 ++++++++++++++++++
 2 files changed, 485 insertions(+)
 create mode 100644 specs/score-incomplete-and-dumb-model-tier/spec.md

diff --git a/specs/index.md b/specs/index.md
index 9d81502..7381fab 100644
--- a/specs/index.md
+++ b/specs/index.md
@@ -6,6 +6,7 @@ Implementation specs for the bakeoff benchmark.
 
 | Spec | Status | Summary |
 |------|--------|---------|
+| [score-incomplete-and-dumb-model-tier](score-incomplete-and-dumb-model-tier/spec.md) | Active | Failure-reason taxonomy, completeness-weighted partial scoring, and a minimal-capability floor tier for weak models. |
 
 ## Done
 
diff --git a/specs/score-incomplete-and-dumb-model-tier/spec.md b/specs/score-incomplete-and-dumb-model-tier/spec.md
new file mode 100644
index 0000000..6b7ddce
--- /dev/null
+++ b/specs/score-incomplete-and-dumb-model-tier/spec.md
@@ -0,0 +1,484 @@
+# Score Incomplete Runs And Minimal-Capability Test Tier
+
+Status: Active
+
+Refs: Rethunk-AI/bakeoff#23 (harness), Rethunk-AI/bakeoff-results#9 (origin), Rethunk-AI/bakeoff-results#24 (display side)
+
+## Problem
+
+Models that do not complete the full benchmark suite produce no meaningful output
+today. Their result records carry a free-text `error` string and a `null` score;
+the bundle has no run-level status field; and the manifest carries no aggregate
+signal the results site can use to badge or rank those runs. The site therefore
+has nothing to display for weak or failing models, making the bakeoff a pass/fail
+gate rather than an educational reference.
+
+Three concrete gaps to close:
+
+1. **Failure-reason capture.** The `error` field in records is unstructured; the
+   SQL `failure_reason` column in `run_model_metrics` and the `error_detail`
+   column in `run_queue` are free text. There is no taxonomy, so the site cannot
+   group or filter by failure mode.
+
+2. **Relative scoring.** Incomplete cells have `score: null`. A model that
+   attempted 6 of 10 tasks ranks the same (invisible) as one that refused every
+   prompt.
+
+3. **Minimal-capability floor.** There is no dedicated "dumb model" suite. A
+   model that can do basic arithmetic and follow one-word instructions might still
+   be useful; the current suite does not expose that.
+
+## Goals
+
+- Define a controlled failure-reason taxonomy that replaces the free-text `error`
+  field in result records, maps to the SQL `failure_reason` column, and is
+  preserved in result bundles.
+- Define a completeness-weighted partial-score formula that produces a numeric
+  rank for every model, including those that fail all cells.
+- Specify a minimal-capability test tier (`dumb_model`) — a fixed task list, its
+  scorers, and its reporting path — that runs independently of the main suite and
+  is always included in the bundle.
+- Define the exact `result.json` and `manifest.json` additions so
+  `bakeoff-results` can render failure reasons, partial scores, and an
+  `incomplete`/`failed` state badge without parsing the full result payload for
+  index rendering.
+
+## Non-Goals
+
+- This spec does not implement the changes. No harness code is modified here.
+- It does not change the judge subsystem or pairwise evaluation logic.
+- It does not add retries, resume logic, or queue management (covered in
+  `benchmark-resume-partial-rerun`).
+- It does not define the display-side rendering (Rethunk-AI/bakeoff-results#24
+  owns that); it only defines the schema contract those renderers depend on.
+- It does not modify `SCHEMA_VERSION` to `bakeoff-results/v2`; all additions are
+  additive and optional, preserving backward compatibility with existing bundles.
+
+## Design
+
+### 1. Failure-Reason Taxonomy
+
+Replace the unstructured `error` string in result records with a pair of fields:
+`failure_code` (controlled enum string) and `failure_detail` (optional free
+text). The `error` field is **retained for backward compatibility** but its value
+is duplicated into `failure_detail` whenever a code is known. New writers must
+populate `failure_code`; old readers that only inspect `error` continue to work.
+
+#### Taxonomy enum (`failure_code`)
+
+| Code | Meaning |
+|------|---------|
+| `timeout` | Model did not respond within the configured `timeout_s`. |
+| `refusal` | Model responded but explicitly declined to answer (safety, topic rejection, etc.). The response text is non-empty but contains a refusal marker. |
+| `malformed_output` | Response was received but could not be parsed or scored (e.g., expected JSON but got prose; expected one-word answer but got a paragraph). |
+| `oom` | Out-of-memory signal observed during inference (VRAM or RAM exhaustion; loader error containing OOM markers). |
+| `load_failure` | Model could not be loaded or swapped in at all (binary crash, missing file, incompatible quantization). |
+| `capability_gap` | Heuristic scorer returned 0.0 AND judge (when available) returned 1/5 or "wrong"; aggregated across ≥ 50% of that model's cells, indicating systematic inability rather than single-cell failure. Applied at post-processing time, not per-call. |
+| `infra_error` | Runner-side infrastructure failure unrelated to the model (proxy crash, network error to server, runner OOM from OS). |
+| `cancelled` | Cell was explicitly cancelled (e.g., operator interrupt, `run_queue.status = CANCELLED`). |
+| `unknown` | Exception was caught but does not match any above pattern. Use as a last resort; the full exception string goes into `failure_detail`. |
+
+Detection rules for each code are heuristic (pattern-matching on exception type,
+message, and HTTP status from the proxy). The implementation must document its
+detection regexes in `bench/metrics.py` or a new `bench/failure.py`. The spec
+only mandates the taxonomy.
+
+#### Record-level schema change
+
+Existing record (already emitted by `runner.py`):
+
+```json
+{
+  "task_id": "t0000",
+  "prompt_id": "p0",
+  "model_id": "qwen3-8b-q4_k_m",
+  "error": "httpx.ReadTimeout: timed out after 120s"
+}
+```
+
+Extended record (new fields additive; `error` preserved):
+
+```json
+{
+  "task_id": "t0000",
+  "prompt_id": "p0",
+  "model_id": "qwen3-8b-q4_k_m",
+  "error": "httpx.ReadTimeout: timed out after 120s",
+  "failure_code": "timeout",
+  "failure_detail": "httpx.ReadTimeout: timed out after 120s"
+}
+```
+
+Successful records carry `"failure_code": null, "failure_detail": null` (or omit
+the keys entirely — readers must treat absent as null).
+
+The SQL `run_model_metrics.failure_reason TEXT` column is updated to hold the
+`failure_code` enum value (not the full detail). A separate
+`run_model_metrics.failure_detail TEXT` column is added. The
+`run_queue.error_detail` column is unchanged (it remains a free-text exception
+log).
+
+### 2. Partial Score Formula
+
+#### Motivation
+
+A completeness-weighted score penalises models that bail early. A model that
+attempted 6 of 10 tasks and scored 0.8 on those 6 should rank below a model
+that attempted all 10 and scored 0.6, because the former model's effective
+capability across the full matrix is unknown and possibly zero on the skipped
+tasks.
+
+#### Definitions
+
+For a given model `m` across a run with `C` total cells (task × prompt
+combinations):
+
+- `C` = total cells in the matrix (constant per run; known before any model is
+  evaluated).
+- `A(m)` = number of cells attempted by model `m` (i.e., a response was
+  received, even if it scored 0).
+- `S(m)` = sum of per-cell scores for model `m` over attempted cells. Each cell
+  score is the `quality_heuristic` value (float in [0, 1]) when available, else
+  the judge score normalized to [0, 1] via `(judge_score - 1) / 4.0` (for the
+  1–5 rubric), else 0.0 for cells where the error is a hard failure.
+- `completeness(m)` = `A(m) / C` (float in [0, 1]).
+
+**Partial score formula:**
+
+```
+partial_score(m) = (S(m) / C)
+```
+
+This is equivalent to treating every unattempted cell as scoring 0.0, which is
+the completeness-weighted formulation. Dividing by `C` (not `A(m)`) is the key
+decision: it ensures a model that completed half the matrix and scored perfectly
+on those cells (partial_score = 0.5) ranks below a model that completed the full
+matrix with a 0.6 average (partial_score = 0.6).
+
+**Worked example:**
+
+| Model | C | A(m) | S(m) | partial_score | completeness |
+|-------|---|------|------|---------------|--------------|
+| strong-model | 10 | 10 | 8.2 | 0.82 | 1.00 |
+| middling-model | 10 | 10 | 5.5 | 0.55 | 1.00 |
+| partial-model | 10 | 6 | 5.4 | 0.54 | 0.60 |
+| weak-model | 10 | 4 | 2.0 | 0.20 | 0.40 |
+| failing-model | 10 | 0 | 0.0 | 0.00 | 0.00 |
+
+`partial-model` (attempted 6, scored 0.54 total) ranks below `middling-model`
+(completed all 10, scored 0.55 total) despite having a higher per-attempted-cell
+average (0.90 vs 0.55). This is intentional: the harness cannot know how the
+partial model would have scored on the remaining 4 cells, and absent data is
+treated as failure.
+
+#### Run-level status
+
+Each model within a run is assigned one of three statuses based on completeness:
+
+| Status | Condition |
+|--------|-----------|
+| `complete` | `completeness(m) == 1.0` (all cells attempted and no hard load failure). |
+| `incomplete` | `0.0 < completeness(m) < 1.0` (at least one cell attempted, at least one missed). |
+| `failed` | `completeness(m) == 0.0` (no cells completed; load failure or all errors). |
+
+The **run-level status** is the worst status across all models:
+- Any model `failed` → run status `failed`.
+- Else any model `incomplete` → run status `incomplete`.
+- Else run status `complete`.
+
+#### Per-model rollup in `result.json`
+
+A new top-level `model_scores` list is added to the result payload. Each entry
+is a per-model aggregate computed by the runner post-hoc (after all records are
+collected). Existing `records` are not modified beyond the `failure_code` field.
+
+```json
+{
+  "model_scores": [
+    {
+      "model_id": "strong-model",
+      "status": "complete",
+      "cells_total": 10,
+      "cells_attempted": 10,
+      "cells_failed": 0,
+      "completeness": 1.0,
+      "partial_score": 0.82,
+      "dominant_failure_code": null
+    },
+    {
+      "model_id": "partial-model",
+      "status": "incomplete",
+      "cells_total": 10,
+      "cells_attempted": 6,
+      "cells_failed": 2,
+      "completeness": 0.60,
+      "partial_score": 0.54,
+      "dominant_failure_code": "timeout"
+    },
+    {
+      "model_id": "failing-model",
+      "status": "failed",
+      "cells_total": 10,
+      "cells_attempted": 0,
+      "cells_failed": 10,
+      "completeness": 0.0,
+      "partial_score": 0.0,
+      "dominant_failure_code": "load_failure"
+    }
+  ]
+}
+```
+
+`dominant_failure_code` is the most-frequent `failure_code` among that model's
+failed cells (`null` when no cells failed). It gives the site a single badge
+reason without requiring it to scan the full `records` array.
+
+#### Run-level status in `result.json`
+
+A `run_status` field is added to the top-level payload alongside `run_id` and
+`timestamp`. Enum values: `"complete"`, `"incomplete"`, `"failed"`. Absent in
+old bundles; readers must treat absent as `"complete"` for backward
+compatibility.
+
+```json
+{
+  "run_id": "amd-8060s-2026-05-28T12:00:00Z",
+  "timestamp": "2026-05-28T12:00:00Z",
+  "run_status": "incomplete",
+  ...
+}
+```
+
+### 3. Minimal-Capability ("Dumb Model") Test Tier
+
+#### Rationale
+
+The main task suite uses `judge` scorers for code and summarization, which
+require a coherent response the judge can evaluate. A model too weak to pass the
+main suite may still respond coherently to trivial tasks. The floor tier uses
+only deterministic scorers (`exact`, `contains`) so it can score models that
+produce the judge target model (and so avoids circular dependency) and models
+that the judge would rate as 1/5 anyway.
+
+The tier runs as a separate phase **before** the main matrix, so a model that
+crashes the loader during the main phase still has a floor score if it booted at
+all. If the loader fails for all models (run status `failed`), the tier yields
+all-zero floor scores with `failure_code: load_failure`.
+
+#### Tier identifier
+
+Tasks in this tier carry `"tier": "dumb_model"`. The main suite carries
+`"tier": "main"` (absent in old tasks; readers treat absent as `"main"`). The
+`task_categories` table gains a row:
+
+```sql
+INSERT INTO task_categories (name, description) VALUES
+    ('dumb_model', 'Minimal-capability floor suite: basic arithmetic, short summarization, instruction-following. Deterministic scorers only; no judge required.');
+```
+
+#### Task list (fixed, versioned)
+
+The floor suite is a fixed set of 12 tasks, version-pinned by `natural_key_hash`
+(same mechanism as main tasks). They are **not** seeded from `dataset.generate()`
+because their prompts must not change between runs (reproducibility requirement).
+They live in a new file `datasets/dumb_model_tasks.jsonl` committed to the repo.
+
+| # | Domain | Prompt | Expected | Scorer |
+|---|--------|--------|----------|--------|
+| 1 | `arithmetic` | `What is 2 + 2? Answer with one number.` | `4` | `exact` |
+| 2 | `arithmetic` | `What is 10 - 3? Answer with one number.` | `7` | `exact` |
+| 3 | `arithmetic` | `What is 6 × 7? Answer with one number.` | `42` | `exact` |
+| 4 | `arithmetic` | `What is 100 ÷ 4? Answer with one number.` | `25` | `exact` |
+| 5 | `instruction` | `Reply with exactly one word: the color of the sky. One word only.` | `blue` | `exact` |
+| 6 | `instruction` | `Count to 3 and list only the numbers, separated by commas.` | `1, 2, 3` | `contains` |
+| 7 | `instruction` | `Reply YES if you can understand this sentence, NO otherwise.` | `YES` | `exact` |
+| 8 | `instruction` | `What is the opposite of hot? Answer with one word.` | `cold` | `exact` |
+| 9 | `summarize` | `Summarize in 5 words or fewer: The cat sat on the mat.` | `cat sat on mat` | `contains` |
+| 10 | `summarize` | `In one word, what animal says "woof"?` | `dog` | `exact` |
+| 11 | `qa` | `What is the capital of France? Answer with one word.` | `Paris` | `exact` |
+| 12 | `qa` | `What color do you get when you mix red and blue? Answer with one word.` | `purple` | `contains` |
+
+Notes:
+- Tasks 5 and 7 test instruction-following (single-word / yes-no constraint
+  compliance). The scorer is `exact` after case-folding and stripping whitespace.
+- Tasks 9 and 12 use `contains` because phrasing variation is acceptable
+  (e.g., "The cat sat" contains "cat sat on").
+- All 12 use a single prompt variant (no prompt rotation) to keep the tier cheap
+  and reproducible.
+- Expected values are case-folded before comparison by the `exact` scorer
+  (existing behaviour in `metrics.score_heuristic`).
+
+#### Scoring the floor tier
+
+Each of the 12 cells scores 0.0 or 1.0 (binary, no partial credit within a
+cell). The floor score for a model is:
+
+```
+floor_score(m) = (number of dumb_model cells scored 1.0) / 12
+```
+
+Floor score is **not** blended into `partial_score`. They are separate fields
+because they measure different things: `partial_score` measures relative
+performance on the main suite; `floor_score` measures minimal capability. A
+model that passes all 12 floor tasks but fails the main suite entirely would have
+`partial_score: 0.0, floor_score: 1.0`.
+
+Floor records are included in the `records` array alongside main-suite records,
+identified by `"tier": "dumb_model"` on each record.
+
+#### Floor score in `model_scores`
+
+The `model_scores` entry gains two fields:
+
+```json
+{
+  "model_id": "failing-model",
+  "status": "failed",
+  "partial_score": 0.0,
+  "floor_score": 0.33,
+  "floor_cells_passed": 4,
+  "floor_cells_total": 12
+}
+```
+
+`floor_score` is `null` when the floor tier was not run (e.g., operator
+explicitly disabled it via a new config flag `dumb_model_tier.enabled: false`).
+The default is enabled.
+
+### 4. Bundle and Manifest Schema Additions
+
+#### `result.json` — new and changed fields
+
+All new fields are **additive and optional**. `validate_result_payload()` in
+`bench/publish.py` does not make them required for `bakeoff-results/v1`
+compatibility; a new validation level (`--strict`) may flag their absence as a
+warning (not an error).
+
+| Field path | Type | Description |
+|------------|------|-------------|
+| `run_status` | `"complete" \| "incomplete" \| "failed"` | Worst model status across the run. New top-level field. |
+| `records[*].failure_code` | `string \| null` | Controlled taxonomy code (see §1). Null on success. |
+| `records[*].failure_detail` | `string \| null` | Optional free-text detail (mirrors legacy `error`). |
+| `records[*].tier` | `"main" \| "dumb_model"` | Which task suite this record belongs to. Absent = `"main"`. |
+| `model_scores` | `array` | Per-model aggregate list (see §2 + §3). New top-level field. |
+| `model_scores[*].model_id` | `string` | Model identifier matching `config.models[*].id`. |
+| `model_scores[*].status` | `"complete" \| "incomplete" \| "failed"` | Per-model completeness status. |
+| `model_scores[*].cells_total` | `integer` | Total main-suite cells for this model. |
+| `model_scores[*].cells_attempted` | `integer` | Cells where a response was received (even if score 0). |
+| `model_scores[*].cells_failed` | `integer` | Cells with a non-null `failure_code`. |
+| `model_scores[*].completeness` | `float [0,1]` | `cells_attempted / cells_total`. |
+| `model_scores[*].partial_score` | `float [0,1]` | Completeness-weighted score (see §2 formula). |
+| `model_scores[*].dominant_failure_code` | `string \| null` | Most frequent `failure_code` among failed cells. |
+| `model_scores[*].floor_score` | `float [0,1] \| null` | Floor tier score (passed cells / 12). Null if tier not run. |
+| `model_scores[*].floor_cells_passed` | `integer \| null` | Number of dumb_model cells that scored 1.0. |
+| `model_scores[*].floor_cells_total` | `integer \| null` | Always 12 when floor tier ran; null otherwise. |
+
+#### `manifest.json` — new fields
+
+The manifest is the lightweight index entry the results site reads without
+parsing the full `result.json`. It needs enough data to render the badge and
+rank in the index page.
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `run_status` | `"complete" \| "incomplete" \| "failed" \| null` | Mirrors `result.json` top-level field. Null in old manifests. |
+| `model_scores_summary` | `array \| null` | Compact per-model summary for the index. Null in old manifests. |
+
+`model_scores_summary` entry shape (minimal, for fast index rendering):
+
+```json
+{
+  "model_id": "partial-model",
+  "status": "incomplete",
+  "partial_score": 0.54,
+  "floor_score": 0.67,
+  "dominant_failure_code": "timeout"
+}
+```
+
+The full `model_scores` array (with cell counts and completeness) stays in
+`result.json`. The summary is a projection of only the fields needed for ranking
+and badging.
+
+`_build_manifest()` in `bench/publish.py` must be updated to populate these two
+fields from the result payload. The `validate_bundle()` function should accept
+both old manifests (fields absent → treated as null) and new ones.
+
+#### SQL additions
+
+Two changes to `schema/schema.sql`:
+
+1. Add `failure_detail TEXT` column to `run_model_metrics`:
+
+```sql
+-- Extend run_model_metrics (Rethunk-AI/bakeoff#23)
+ALTER TABLE run_model_metrics
+    ADD COLUMN IF NOT EXISTS failure_detail TEXT;
+-- (failure_reason already exists; its values now conform to the taxonomy enum)
+```
+
+2. Add `dumb_model` row to `task_categories` (seed, idempotent):
+
+```sql
+INSERT INTO task_categories (name, description)
+    VALUES ('dumb_model', 'Minimal-capability floor suite: deterministic scorers only.')
+    ON CONFLICT (name) DO NOTHING;
+```
+
+3. Add `run_status` column to `runs`:
+
+```sql
+ALTER TABLE runs
+    ADD COLUMN IF NOT EXISTS run_status TEXT
+        CHECK (run_status IN ('complete', 'incomplete', 'failed'));
+```
+
+### 5. Downstream Contract (bakeoff-results)
+
+This section lists precisely which fields `bakeoff-results` (display side,
+Rethunk-AI/bakeoff-results#24) depends on, keyed by use case.
+
+| Use case | Field(s) consumed |
+|----------|-------------------|
+| Index page: badge run as complete / incomplete / failed | `manifest.json → run_status` |
+| Index page: rank runs / models by score (including partial) | `manifest.json → model_scores_summary[*].partial_score` |
+| Index page: show failure-mode label | `manifest.json → model_scores_summary[*].dominant_failure_code` |
+| Detail page: per-model status chip | `result.json → model_scores[*].status` |
+| Detail page: completeness progress bar | `result.json → model_scores[*].completeness`, `cells_attempted`, `cells_total` |
+| Detail page: per-model floor badge | `result.json → model_scores[*].floor_score`, `floor_cells_passed`, `floor_cells_total` |
+| Detail page: per-cell failure tooltip | `result.json → records[*].failure_code`, `records[*].failure_detail` |
+| Filter/group by failure type | `manifest.json → model_scores_summary[*].dominant_failure_code` |
+| Floor-tier table | `result.json → records` filtered by `tier == "dumb_model"` |
+
+The results site must treat all new fields as **optional**. Absent fields (old
+bundles) fall back to: `run_status → "complete"`, `model_scores_summary → []`,
+`floor_score → null`, `failure_code → null`.
+
+bakeoff-results#21 (result-state badge) and bakeoff-results#24 (display side)
+both depend on `run_status` and `model_scores_summary` being present in the
+manifest; the schema defined here is the upstream authority.
+
+## Acceptance Criteria
+
+- The `failure_code` enum is documented in `bench/` (a module docstring or a new
+  `bench/failure.py`) and maps to the nine taxonomy values defined in §1.
+- Every failed cell in `result.json` carries a non-null `failure_code`.
+- Successful cells carry `failure_code: null` (or omit the field).
+- `partial_score` is computed using the completeness-weighted formula in §2 for
+  every model, including those with zero attempted cells.
+- `floor_score` is computed for every model when the dumb_model tier is enabled.
+- The 12 dumb_model tasks are committed to `datasets/dumb_model_tasks.jsonl` and
+  their content never changes without a version bump (to preserve reproducibility
+  across runs from different dates).
+- `run_status` appears in both `result.json` and `manifest.json`.
+- `model_scores` appears in `result.json`; `model_scores_summary` appears in
+  `manifest.json`.
+- `validate_result_payload()` does not reject old bundles that lack the new
+  fields.
+- `validate_bundle()` does not reject old manifests that lack `run_status` and
+  `model_scores_summary`.
+- A new `--strict` validation flag (or equivalent) warns (not errors) when new
+  fields are absent.
+- Tests cover: taxonomy code detection for each code, partial_score formula with
+  worked example values matching §2, floor tier scoring (all-pass, all-fail,
+  mixed), manifest projection, and backward-compat with old payloads.