TwinBench v1 accepts result submissions in artifact form.
- benchmark version
- system name
- system version
- evaluation date
- scenario-level observations
- per-metric scores
- total score
- scenario coverage
- metric coverage
- evaluator notes
- caveats
Preferred submissions also include:
- links to result artifacts
- links to prompts or run manifests
- explanation of any scenario deviations
- notes on whether evaluation steps were automated or evaluator-mediated
- Use the canonical result structure shown in LEADERBOARD.md.
- Keep caveats explicit.
- Do not report benchmark totals without metric-level detail.
- If only part of the scenario set was run, disclose that clearly.
TwinBench is intended to reward honest reporting more than flattering reporting.