You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<div class="eyebrow">The open benchmark for personal AI assistants</div>
296
310
<h1>TwinBench</h1>
297
311
<p class="lede">TwinBench measures whether an AI system can behave like a real personal AI assistant: remember, act, follow up, stay safe, and operate over time.</p>
312
+
<p class="lede" style="max-width:42ch"><strong>Can your personal AI assistant beat TwinBench?</strong> Start with the reference result, then run the demo or benchmark your own system.</p>
298
313
<div class="actions">
299
314
<a class="button" href="results/{top['slug']}/index.html">See the Reference Result</a>
300
-
<a class="button secondary" href="submit/index.html">Submit Your Assistant</a>
315
+
<a class="button secondary" href="submit/index.html">Benchmark Your Assistant</a>
<p>The board is one place, but every result keeps its class, coverage, and evidence story. Trust comes from artifacts, not claims.</p>
329
+
<p>The public board shows the current reference result and challenge-worthy artifacts. Historical and degraded runs stay available, but they do not dominate first impression.</p>
<p><strong>Why it matters:</strong> {why_matters}</p>
456
+
</div>
457
+
<div class="panel">
458
+
<div class="eyebrow">Evidence</div>
459
+
<h2>How to read it</h2>
460
+
<p>Use coverage-adjusted verified score for public comparison, verified raw for direct measurement strength, and measured coverage to understand how much of the benchmark was truly exercised.</p>
<div class="card"><h3>Coverage matters</h3><p>The headline ranking number is coverage-adjusted verified score, not the most flattering number in the artifact.</p></div>
443
486
<div class="card"><h3>Trust over hype</h3><p>Unsupported surfaces, missing bootstrap, and partial measurement are reported explicitly instead of flattened into a false failure.</p></div>
444
487
</section>
488
+
489
+
<section class="panel prose section">
490
+
<h2>What the headline numbers mean</h2>
491
+
<p><strong>Verified</strong> is what the run directly proved. <strong>Projected</strong> is the broader estimate with explicit assumptions. <strong>Measured coverage</strong> tells you how much of the benchmark was directly exercised.</p>
492
+
<p>TwinBench uses <strong>coverage-adjusted verified score</strong> for public ranking because it rewards both strength and honest measurement.</p>
493
+
<h2>Why unavailable is not failure</h2>
494
+
<p>Some systems do not expose the runtime surfaces needed for a fair direct measurement. TwinBench records that explicitly instead of pretending they cleanly failed a dimension.</p>
495
+
</section>
445
496
""",
497
+
depth=1,
446
498
)
447
499
448
500
@@ -463,10 +515,15 @@ def _render_faq() -> str:
463
515
<p>No. Nullalis is the current reference runtime because it produced the first strong public artifact.</p>
464
516
<h2>Can I run TwinBench quickly?</h2>
465
517
<p>Yes. Use the demo path from the repo or run against a native runtime with one command.</p>
518
+
<h2>Why can some dimensions be unavailable?</h2>
519
+
<p>Because some systems do not expose the runtime surfaces required for a fair direct measurement. TwinBench shows that honestly rather than hiding it.</p>
520
+
<h2>Why does coverage matter?</h2>
521
+
<p>Coverage shows how much of the benchmark was truly exercised. A flattering score with weak coverage should not outrank a strong, deeply measured artifact.</p>
466
522
<h2>What if my assistant only supports part of the benchmark?</h2>
467
523
<p>That is still useful. TwinBench prefers honest partial artifacts over fake comparability.</p>
468
524
</section>
469
525
""",
526
+
depth=1,
470
527
)
471
528
472
529
@@ -494,6 +551,7 @@ def _render_submit() -> str:
494
551
<p><a href="https://github.com/ProjectNuggets/DTaaS-benchmark/issues/new?template=submit-results.md">Open a results submission</a></p>
<p class="lede">Use this page when you want a clean side-by-side view instead of two tabs and a screenshot.</p>
569
+
<p class="lede">Use this page when you want a clean side-by-side view instead of two tabs and a screenshot. The reference runtime is preselected to make comparison fast.</p>
<pclass="lede">Use this page when you want a clean side-by-side view instead of two tabs and a screenshot.</p>
23
+
<pclass="lede">Use this page when you want a clean side-by-side view instead of two tabs and a screenshot. The reference runtime is preselected to make comparison fast.</p>
24
24
</section>
25
25
26
26
<sectionclass="panel prose">
27
27
<labelfor="left">Left</label>
28
-
<selectid="left"><optionvalue='nullalis-live-2026-03-25-openended'>Nullalis local openended race (nullalis-live-2026-03-25-openended)</option><optionvalue='nullalis-local-2026-03-24'>Nullalis local live (nullalis-local-2026-03-24)</option><optionvalue='twinbench-demo-runtime'>TwinBench Demo Runtime (twinbench-demo-runtime)</option><optionvalue='nullalis-live-2026-03-25'>Nullalis local live (nullalis-live-2026-03-25)</option><optionvalue='nullalis-scale-probe'>Nullalis scale probe (nullalis-scale-probe)</option><optionvalue='nullalis-targeted-2026-03-24'>Nullalis local live (nullalis-targeted-2026-03-24)</option></select>
28
+
<selectid="left"><optionvalue='nullalis-live-2026-03-25-openended'>Nullalis Reference Runtime (nullalis-live-2026-03-25-openended)</option><optionvalue='nullalis-local-2026-03-24'>Nullalis Baseline (nullalis-local-2026-03-24)</option><optionvalue='twinbench-demo-runtime'>TwinBench Demo Runtime (twinbench-demo-runtime)</option><optionvalue='nullalis-live-2026-03-25'>Nullalis Auth-Failed Run (nullalis-live-2026-03-25)</option><optionvalue='nullalis-scale-probe'>Nullalis Scale Fairness Probe (nullalis-scale-probe)</option><optionvalue='nullalis-targeted-2026-03-24'>Nullalis Degraded Recovery Run (nullalis-targeted-2026-03-24)</option></select>
29
29
<labelfor="right">Right</label>
30
-
<selectid="right"><optionvalue='nullalis-live-2026-03-25-openended'>Nullalis local openended race (nullalis-live-2026-03-25-openended)</option><optionvalue='nullalis-local-2026-03-24'>Nullalis local live (nullalis-local-2026-03-24)</option><optionvalue='twinbench-demo-runtime'>TwinBench Demo Runtime (twinbench-demo-runtime)</option><optionvalue='nullalis-live-2026-03-25'>Nullalis local live (nullalis-live-2026-03-25)</option><optionvalue='nullalis-scale-probe'>Nullalis scale probe (nullalis-scale-probe)</option><optionvalue='nullalis-targeted-2026-03-24'>Nullalis local live (nullalis-targeted-2026-03-24)</option></select>
30
+
<selectid="right"><optionvalue='nullalis-live-2026-03-25-openended'>Nullalis Reference Runtime (nullalis-live-2026-03-25-openended)</option><optionvalue='nullalis-local-2026-03-24'>Nullalis Baseline (nullalis-local-2026-03-24)</option><optionvalue='twinbench-demo-runtime'>TwinBench Demo Runtime (twinbench-demo-runtime)</option><optionvalue='nullalis-live-2026-03-25'>Nullalis Auth-Failed Run (nullalis-live-2026-03-25)</option><optionvalue='nullalis-scale-probe'>Nullalis Scale Fairness Probe (nullalis-scale-probe)</option><optionvalue='nullalis-targeted-2026-03-24'>Nullalis Degraded Recovery Run (nullalis-targeted-2026-03-24)</option></select>
@@ -30,6 +30,10 @@ <h2>Is this only for Nullalis?</h2>
30
30
<p>No. Nullalis is the current reference runtime because it produced the first strong public artifact.</p>
31
31
<h2>Can I run TwinBench quickly?</h2>
32
32
<p>Yes. Use the demo path from the repo or run against a native runtime with one command.</p>
33
+
<h2>Why can some dimensions be unavailable?</h2>
34
+
<p>Because some systems do not expose the runtime surfaces required for a fair direct measurement. TwinBench shows that honestly rather than hiding it.</p>
35
+
<h2>Why does coverage matter?</h2>
36
+
<p>Coverage shows how much of the benchmark was truly exercised. A flattering score with weak coverage should not outrank a strong, deeply measured artifact.</p>
33
37
<h2>What if my assistant only supports part of the benchmark?</h2>
34
38
<p>That is still useful. TwinBench prefers honest partial artifacts over fake comparability.</p>
0 commit comments