docs(spec): score incomplete runs + minimal-capability test tier#24
Merged
Merged
Conversation
…#23) Adds a design specification covering failure-reason taxonomy (9-code enum extending the existing failure_reason/error fields), completeness- weighted partial scoring formula with worked example, a 12-task minimal-capability ('dumb model') floor tier using deterministic scorers, and the result.json / manifest.json schema additions the bakeoff-results site needs to render failure reasons, partial scores, and incomplete/failed badges. All new fields are additive; bakeoff-results/v1 compatibility is preserved.
There was a problem hiding this comment.
Pull request overview
This documentation-only PR adds an active design spec for scoring incomplete benchmark runs and introducing a minimal-capability “dumb model” tier, addressing the harness-side requirements from issue #23.
Changes:
- Defines a controlled failure-code taxonomy and additive result/manifest schema fields.
- Specifies completeness-weighted partial scoring and per-model/run status aggregation.
- Proposes a fixed 12-task deterministic floor tier for minimal model capability checks.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
specs/score-incomplete-and-dumb-model-tier/spec.md |
Adds the full design spec for failure reasons, partial scoring, floor-tier tasks, schema additions, and acceptance criteria. |
specs/index.md |
Registers the new spec under Active specs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | 3 | `arithmetic` | `What is 6 × 7? Answer with one number.` | `42` | `exact` | | ||
| | 4 | `arithmetic` | `What is 100 ÷ 4? Answer with one number.` | `25` | `exact` | | ||
| | 5 | `instruction` | `Reply with exactly one word: the color of the sky. One word only.` | `blue` | `exact` | | ||
| | 6 | `instruction` | `Count to 3 and list only the numbers, separated by commas.` | `1, 2, 3` | `contains` | |
| | 6 | `instruction` | `Count to 3 and list only the numbers, separated by commas.` | `1, 2, 3` | `contains` | | ||
| | 7 | `instruction` | `Reply YES if you can understand this sentence, NO otherwise.` | `YES` | `exact` | | ||
| | 8 | `instruction` | `What is the opposite of hot? Answer with one word.` | `cold` | `exact` | | ||
| | 9 | `summarize` | `Summarize in 5 words or fewer: The cat sat on the mat.` | `cat sat on mat` | `contains` | |
|
|
||
| #### SQL additions | ||
|
|
||
| Two changes to `schema/schema.sql`: |
Comment on lines
+256
to
+261
| The main task suite uses `judge` scorers for code and summarization, which | ||
| require a coherent response the judge can evaluate. A model too weak to pass the | ||
| main suite may still respond coherently to trivial tasks. The floor tier uses | ||
| only deterministic scorers (`exact`, `contains`) so it can score models that | ||
| produce the judge target model (and so avoids circular dependency) and models | ||
| that the judge would rate as 1/5 anyway. |
Comment on lines
+133
to
+134
| For a given model `m` across a run with `C` total cells (task × prompt | ||
| combinations): |
Comment on lines
+211
to
+213
| "cells_total": 10, | ||
| "cells_attempted": 6, | ||
| "cells_failed": 2, |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
specs/score-incomplete-and-dumb-model-tier/spec.md— a concrete design spec covering three capabilities the harness must add (refs Score incomplete runs + add a minimal-capability ('dumb model') test tier #23).specs/index.mdunder Active.What the spec covers
1. Failure-reason taxonomy — replaces the free-text
errorfield with a controlledfailure_codeenum (9 values:timeout,refusal,malformed_output,oom,load_failure,capability_gap,infra_error,cancelled,unknown) plus an optionalfailure_detailstring. Maps to the existingrun_model_metrics.failure_reasonSQL column and the existingerrorfield (preserved for backward compat).2. Completeness-weighted partial scoring — defines
partial_score(m) = S(m) / C(sum of per-cell scores divided by total cells, treating unattempted cells as 0.0). Includes a worked example showing why completeness-weighting ranks a bailing model below one that limped through. Introducesmodel_scoresinresult.json(per-model aggregate) andmodel_scores_summaryinmanifest.json(lightweight index projection).3. Minimal-capability floor tier — a fixed 12-task
dumb_modelsuite (arithmetic, instruction-following, short QA, summarization) using only deterministic scorers (exact/contains), no judge required. Floor score is separate from partial score and reported asfloor_scoreinmodel_scores.4. Schema additions — all new fields are additive/optional;
bakeoff-results/v1compatibility preserved. Exact field table, types, JSON examples, and a downstream contract section cross-referencing bakeoff-results#9 and bakeoff-results#24.Test plan
Advances #23 (design spec; implementation tracked separately).