Skip to content

Latest commit

 

History

History
35 lines (26 loc) · 908 Bytes

File metadata and controls

35 lines (26 loc) · 908 Bytes

Submitting TwinBench Results

TwinBench v1 accepts result submissions in artifact form.

Minimum Submission Contents

  • benchmark version
  • system name
  • system version
  • evaluation date
  • scenario-level observations
  • per-metric scores
  • total score
  • scenario coverage
  • metric coverage
  • evaluator notes
  • caveats

Evidence Expectations

Preferred submissions also include:

  • links to result artifacts
  • links to prompts or run manifests
  • explanation of any scenario deviations
  • notes on whether evaluation steps were automated or evaluator-mediated

Submission Guidance

  • Use the canonical result structure shown in LEADERBOARD.md.
  • Keep caveats explicit.
  • Do not report benchmark totals without metric-level detail.
  • If only part of the scenario set was run, disclose that clearly.

TwinBench is intended to reward honest reporting more than flattering reporting.