docs(spec): score incomplete runs + minimal-capability test tier by AlbinoGeek · Pull Request #24 · Rethunk-AI/bakeoff

AlbinoGeek · 2026-05-29T04:06:24Z

Summary

Adds specs/score-incomplete-and-dumb-model-tier/spec.md — a concrete design spec covering three capabilities the harness must add (refs Score incomplete runs + add a minimal-capability ('dumb model') test tier #23).
Registers the spec in specs/index.md under Active.
No harness code is modified; this is a design-only PR.

What the spec covers

1. Failure-reason taxonomy — replaces the free-text error field with a controlled failure_code enum (9 values: timeout, refusal, malformed_output, oom, load_failure, capability_gap, infra_error, cancelled, unknown) plus an optional failure_detail string. Maps to the existing run_model_metrics.failure_reason SQL column and the existing error field (preserved for backward compat).

2. Completeness-weighted partial scoring — defines partial_score(m) = S(m) / C (sum of per-cell scores divided by total cells, treating unattempted cells as 0.0). Includes a worked example showing why completeness-weighting ranks a bailing model below one that limped through. Introduces model_scores in result.json (per-model aggregate) and model_scores_summary in manifest.json (lightweight index projection).

3. Minimal-capability floor tier — a fixed 12-task dumb_model suite (arithmetic, instruction-following, short QA, summarization) using only deterministic scorers (exact/contains), no judge required. Floor score is separate from partial score and reported as floor_score in model_scores.

4. Schema additions — all new fields are additive/optional; bakeoff-results/v1 compatibility preserved. Exact field table, types, JSON examples, and a downstream contract section cross-referencing bakeoff-results#9 and bakeoff-results#24.

Test plan

Review spec for completeness against issue Score incomplete runs + add a minimal-capability ('dumb model') test tier #23 requirements.
Verify taxonomy codes cover all observable failure modes in current runs.
Check partial score formula with the worked example in §2.
Confirm the 12 dumb-model tasks are unambiguous and deterministically scorable.
Verify the downstream contract section matches what bakeoff-results#24 needs.
Implementation PR (separate) will add tests for each acceptance criterion.

Advances #23 (design spec; implementation tracked separately).

…#23) Adds a design specification covering failure-reason taxonomy (9-code enum extending the existing failure_reason/error fields), completeness- weighted partial scoring formula with worked example, a 12-task minimal-capability ('dumb model') floor tier using deterministic scorers, and the result.json / manifest.json schema additions the bakeoff-results site needs to render failure reasons, partial scores, and incomplete/failed badges. All new fields are additive; bakeoff-results/v1 compatibility is preserved.

Copilot

Pull request overview

This documentation-only PR adds an active design spec for scoring incomplete benchmark runs and introducing a minimal-capability “dumb model” tier, addressing the harness-side requirements from issue #23.

Changes:

Defines a controlled failure-code taxonomy and additive result/manifest schema fields.
Specifies completeness-weighted partial scoring and per-model/run status aggregation.
Proposes a fixed 12-task deterministic floor tier for minimal model capability checks.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
`specs/score-incomplete-and-dumb-model-tier/spec.md`	Adds the full design spec for failure reasons, partial scoring, floor-tier tasks, schema additions, and acceptance criteria.
`specs/index.md`	Registers the new spec under Active specs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+| 3 | `arithmetic` | `What is 6 × 7? Answer with one number.` | `42` | `exact` |
+| 4 | `arithmetic` | `What is 100 ÷ 4? Answer with one number.` | `25` | `exact` |
+| 5 | `instruction` | `Reply with exactly one word: the color of the sky. One word only.` | `blue` | `exact` |
+| 6 | `instruction` | `Count to 3 and list only the numbers, separated by commas.` | `1, 2, 3` | `contains` |


+| 6 | `instruction` | `Count to 3 and list only the numbers, separated by commas.` | `1, 2, 3` | `contains` |
+| 7 | `instruction` | `Reply YES if you can understand this sentence, NO otherwise.` | `YES` | `exact` |
+| 8 | `instruction` | `What is the opposite of hot? Answer with one word.` | `cold` | `exact` |
+| 9 | `summarize` | `Summarize in 5 words or fewer: The cat sat on the mat.` | `cat sat on mat` | `contains` |


+
+#### SQL additions
+
+Two changes to `schema/schema.sql`:


+The main task suite uses `judge` scorers for code and summarization, which
+require a coherent response the judge can evaluate. A model too weak to pass the
+main suite may still respond coherently to trivial tasks. The floor tier uses
+only deterministic scorers (`exact`, `contains`) so it can score models that
+produce the judge target model (and so avoids circular dependency) and models
+that the judge would rate as 1/5 anyway.


+For a given model `m` across a run with `C` total cells (task × prompt
+combinations):


+      "cells_total": 10,
+      "cells_attempted": 6,
+      "cells_failed": 2,


Copilot AI review requested due to automatic review settings May 29, 2026 04:06

Copilot started reviewing on behalf of AlbinoGeek May 29, 2026 04:06 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

AlbinoGeek merged commit 3b75829 into main May 29, 2026
7 of 8 checks passed

AlbinoGeek deleted the spec/score-incomplete-and-dumb-model-tier branch May 29, 2026 04:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(spec): score incomplete runs + minimal-capability test tier#24

docs(spec): score incomplete runs + minimal-capability test tier#24
AlbinoGeek merged 1 commit into
mainfrom
spec/score-incomplete-and-dumb-model-tier

AlbinoGeek commented May 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		For a given model `m` across a run with `C` total cells (task × prompt
		combinations):

Conversation

AlbinoGeek commented May 29, 2026

Summary

What the spec covers

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants