Skip to content

docs(spec): score incomplete runs + minimal-capability test tier#24

Merged
AlbinoGeek merged 1 commit into
mainfrom
spec/score-incomplete-and-dumb-model-tier
May 29, 2026
Merged

docs(spec): score incomplete runs + minimal-capability test tier#24
AlbinoGeek merged 1 commit into
mainfrom
spec/score-incomplete-and-dumb-model-tier

Conversation

@AlbinoGeek

Copy link
Copy Markdown
Member

Summary

What the spec covers

1. Failure-reason taxonomy — replaces the free-text error field with a controlled failure_code enum (9 values: timeout, refusal, malformed_output, oom, load_failure, capability_gap, infra_error, cancelled, unknown) plus an optional failure_detail string. Maps to the existing run_model_metrics.failure_reason SQL column and the existing error field (preserved for backward compat).

2. Completeness-weighted partial scoring — defines partial_score(m) = S(m) / C (sum of per-cell scores divided by total cells, treating unattempted cells as 0.0). Includes a worked example showing why completeness-weighting ranks a bailing model below one that limped through. Introduces model_scores in result.json (per-model aggregate) and model_scores_summary in manifest.json (lightweight index projection).

3. Minimal-capability floor tier — a fixed 12-task dumb_model suite (arithmetic, instruction-following, short QA, summarization) using only deterministic scorers (exact/contains), no judge required. Floor score is separate from partial score and reported as floor_score in model_scores.

4. Schema additions — all new fields are additive/optional; bakeoff-results/v1 compatibility preserved. Exact field table, types, JSON examples, and a downstream contract section cross-referencing bakeoff-results#9 and bakeoff-results#24.

Test plan

  • Review spec for completeness against issue Score incomplete runs + add a minimal-capability ('dumb model') test tier #23 requirements.
  • Verify taxonomy codes cover all observable failure modes in current runs.
  • Check partial score formula with the worked example in §2.
  • Confirm the 12 dumb-model tasks are unambiguous and deterministically scorable.
  • Verify the downstream contract section matches what bakeoff-results#24 needs.
  • Implementation PR (separate) will add tests for each acceptance criterion.

Advances #23 (design spec; implementation tracked separately).

…#23)

Adds a design specification covering failure-reason taxonomy (9-code
enum extending the existing failure_reason/error fields), completeness-
weighted partial scoring formula with worked example, a 12-task
minimal-capability ('dumb model') floor tier using deterministic scorers,
and the result.json / manifest.json schema additions the bakeoff-results
site needs to render failure reasons, partial scores, and incomplete/failed
badges. All new fields are additive; bakeoff-results/v1 compatibility
is preserved.
Copilot AI review requested due to automatic review settings May 29, 2026 04:06

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This documentation-only PR adds an active design spec for scoring incomplete benchmark runs and introducing a minimal-capability “dumb model” tier, addressing the harness-side requirements from issue #23.

Changes:

  • Defines a controlled failure-code taxonomy and additive result/manifest schema fields.
  • Specifies completeness-weighted partial scoring and per-model/run status aggregation.
  • Proposes a fixed 12-task deterministic floor tier for minimal model capability checks.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
specs/score-incomplete-and-dumb-model-tier/spec.md Adds the full design spec for failure reasons, partial scoring, floor-tier tasks, schema additions, and acceptance criteria.
specs/index.md Registers the new spec under Active specs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

| 3 | `arithmetic` | `What is 6 × 7? Answer with one number.` | `42` | `exact` |
| 4 | `arithmetic` | `What is 100 ÷ 4? Answer with one number.` | `25` | `exact` |
| 5 | `instruction` | `Reply with exactly one word: the color of the sky. One word only.` | `blue` | `exact` |
| 6 | `instruction` | `Count to 3 and list only the numbers, separated by commas.` | `1, 2, 3` | `contains` |
| 6 | `instruction` | `Count to 3 and list only the numbers, separated by commas.` | `1, 2, 3` | `contains` |
| 7 | `instruction` | `Reply YES if you can understand this sentence, NO otherwise.` | `YES` | `exact` |
| 8 | `instruction` | `What is the opposite of hot? Answer with one word.` | `cold` | `exact` |
| 9 | `summarize` | `Summarize in 5 words or fewer: The cat sat on the mat.` | `cat sat on mat` | `contains` |

#### SQL additions

Two changes to `schema/schema.sql`:
Comment on lines +256 to +261
The main task suite uses `judge` scorers for code and summarization, which
require a coherent response the judge can evaluate. A model too weak to pass the
main suite may still respond coherently to trivial tasks. The floor tier uses
only deterministic scorers (`exact`, `contains`) so it can score models that
produce the judge target model (and so avoids circular dependency) and models
that the judge would rate as 1/5 anyway.
Comment on lines +133 to +134
For a given model `m` across a run with `C` total cells (task × prompt
combinations):
Comment on lines +211 to +213
"cells_total": 10,
"cells_attempted": 6,
"cells_failed": 2,
@AlbinoGeek AlbinoGeek merged commit 3b75829 into main May 29, 2026
7 of 8 checks passed
@AlbinoGeek AlbinoGeek deleted the spec/score-incomplete-and-dumb-model-tier branch May 29, 2026 04:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants