Polish TwinBench for S-tier public launch

Muhra95 · Muhra95 · commit df360c6dc948 · 2026-03-25T09:51:27.000+01:00
diff --git a/Makefile b/Makefile
@@ -0,0 +1,16 @@
+PYTHON ?= python3.10
+URL ?= http://localhost:8080
+TOKEN ?=
+USER_ID ?= 1
+NAME ?= My Runtime
+
+.PHONY: preflight run run-nullalis
+
+preflight:
+	bash scripts/preflight.sh "$(URL)" "$(TOKEN)" "$(USER_ID)"
+
+run:
+	bash scripts/run_twinbench.sh "$(URL)" "$(TOKEN)" "$(NAME)" "$(USER_ID)"
+
+run-nullalis:
+	bash scripts/run_twinbench_nullalis_local.sh
diff --git a/README.md b/README.md
@@ -22,11 +22,20 @@ Local Nullalis:
 python3.10 -m harness.runner --url http://127.0.0.1:3000 --token-from-nullalis-config --user-id 1 --name "Nullalis Local" --output results/nullalis-local.json --markdown results/nullalis-local.md --html results/nullalis-local.html
 ```
 
+Scripted shortcuts:
+
+```bash
+make preflight URL=http://localhost:8080 TOKEN=YOUR_TOKEN
+make run URL=http://localhost:8080 TOKEN=YOUR_TOKEN NAME="My Runtime"
+make run-nullalis
+```
+
 Quick links:
 - [Overview](docs/OVERVIEW.md)
 - [Why TwinBench](docs/WHY_TWINBENCH.md)
 - [Getting Started in 10 Minutes](docs/GETTING_STARTED.md)
 - [Run with Agents](docs/AGENT_RUN_GUIDE.md)
+- [Troubleshooting](docs/TROUBLESHOOTING.md)
 - [Run Profiles](docs/RUN_PROFILES.md)
 - [Compatibility Checklist](docs/COMPATIBILITY_CHECKLIST.md)
 - [Preflight Checklist](docs/PREFLIGHT.md)
diff --git a/assets/README.md b/assets/README.md
@@ -0,0 +1,7 @@
+# TwinBench Public Assets
+
+This folder contains lightweight visual assets for launch and social sharing.
+
+- `benchmark-score-card.svg`
+- `ten-dimensions.svg`
+- `what-current-benchmarks-miss.svg`
diff --git a/assets/benchmark-score-card.svg b/assets/benchmark-score-card.svg
@@ -0,0 +1,11 @@
+<svg xmlns="http://www.w3.org/2000/svg" width="1200" height="630" viewBox="0 0 1200 630">
+  <rect width="1200" height="630" fill="#0f172a"/>
+  <rect x="48" y="48" width="1104" height="534" rx="28" fill="#111827" stroke="#334155"/>
+  <text x="80" y="132" fill="#e2e8f0" font-family="Arial, sans-serif" font-size="54" font-weight="700">TwinBench</text>
+  <text x="80" y="180" fill="#93c5fd" font-family="Arial, sans-serif" font-size="26">Open Benchmark for Personal AI Assistant Runtimes</text>
+  <text x="80" y="280" fill="#f8fafc" font-family="Arial, sans-serif" font-size="132" font-weight="700">75.9</text>
+  <text x="80" y="328" fill="#cbd5e1" font-family="Arial, sans-serif" font-size="28">Coverage-adjusted verified score</text>
+  <text x="80" y="405" fill="#22c55e" font-family="Arial, sans-serif" font-size="34" font-weight="700">Production-Grade</text>
+  <text x="80" y="490" fill="#cbd5e1" font-family="Arial, sans-serif" font-size="28">Remember. Act. Follow up. Stay safe. Operate over time.</text>
+  <text x="80" y="550" fill="#94a3b8" font-family="Arial, sans-serif" font-size="22">Published by Nova Nuggets</text>
+</svg>
diff --git a/assets/ten-dimensions.svg b/assets/ten-dimensions.svg
@@ -0,0 +1,17 @@
+<svg xmlns="http://www.w3.org/2000/svg" width="1200" height="630" viewBox="0 0 1200 630">
+  <rect width="1200" height="630" fill="#f8fafc"/>
+  <text x="60" y="90" fill="#0f172a" font-family="Arial, sans-serif" font-size="46" font-weight="700">TwinBench: 10 Dimensions</text>
+  <text x="60" y="130" fill="#475569" font-family="Arial, sans-serif" font-size="24">Personal AI assistant runtimes are more than one-shot prompts.</text>
+  <g fill="#0f172a" font-family="Arial, sans-serif" font-size="26">
+    <text x="80" y="200">1. Autonomy Control</text>
+    <text x="80" y="245">2. Memory Persistence</text>
+    <text x="80" y="290">3. Functional Capability</text>
+    <text x="80" y="335">4. Autonomous Execution</text>
+    <text x="80" y="380">5. Cross-Channel Consistency</text>
+    <text x="640" y="200">6. Integration Breadth</text>
+    <text x="640" y="245">7. Security &amp; Privacy</text>
+    <text x="640" y="290">8. Scale &amp; Cost Efficiency</text>
+    <text x="640" y="335">9. Operational Resilience</text>
+    <text x="640" y="380">10. Latency Profile</text>
+  </g>
+</svg>
diff --git a/assets/what-current-benchmarks-miss.svg b/assets/what-current-benchmarks-miss.svg
@@ -0,0 +1,19 @@
+<svg xmlns="http://www.w3.org/2000/svg" width="1200" height="630" viewBox="0 0 1200 630">
+  <rect width="1200" height="630" fill="#ffffff"/>
+  <text x="60" y="90" fill="#0f172a" font-family="Arial, sans-serif" font-size="44" font-weight="700">What Current Benchmarks Miss</text>
+  <text x="60" y="150" fill="#64748b" font-family="Arial, sans-serif" font-size="26">Coding, task, and memory benchmarks matter. They still do not define the runtime category behind personal AI assistants.</text>
+  <g font-family="Arial, sans-serif">
+    <rect x="60" y="220" width="500" height="280" rx="24" fill="#f1f5f9"/>
+    <text x="90" y="270" fill="#0f172a" font-size="30" font-weight="700">Current benchmark focus</text>
+    <text x="90" y="325" fill="#334155" font-size="24">• coding ability</text>
+    <text x="90" y="365" fill="#334155" font-size="24">• task completion</text>
+    <text x="90" y="405" fill="#334155" font-size="24">• isolated memory recall</text>
+    <text x="90" y="445" fill="#334155" font-size="24">• prompt-level output quality</text>
+    <rect x="640" y="220" width="500" height="280" rx="24" fill="#dbeafe"/>
+    <text x="670" y="270" fill="#0f172a" font-size="30" font-weight="700">TwinBench focus</text>
+    <text x="670" y="325" fill="#1e3a8a" font-size="24">• durable memory</text>
+    <text x="670" y="365" fill="#1e3a8a" font-size="24">• autonomous execution</text>
+    <text x="670" y="405" fill="#1e3a8a" font-size="24">• cross-channel coherence</text>
+    <text x="670" y="445" fill="#1e3a8a" font-size="24">• safety, scale, resilience</text>
+  </g>
+</svg>
diff --git a/docs/AGENT_RUN_GUIDE.md b/docs/AGENT_RUN_GUIDE.md
@@ -24,6 +24,22 @@ python3.10 -m harness.runner --url http://127.0.0.1:3000 --token-from-nullalis-c
 Run TwinBench against this runtime at URL X using token Y. First perform the preflight checks, then run the harness, save JSON, Markdown, and HTML artifacts, and summarize the verified score, projected score, measured coverage, dimension statuses, and any unavailable dimensions with reason codes.
 ```
 
+## Scripted Shortcuts
+
+```bash
+bash scripts/preflight.sh YOUR_URL YOUR_TOKEN 1
+bash scripts/run_twinbench.sh YOUR_URL YOUR_TOKEN "Your Runtime" 1
+bash scripts/run_twinbench_nullalis_local.sh
+```
+
+Or with `make`:
+
+```bash
+make preflight URL=YOUR_URL TOKEN=YOUR_TOKEN
+make run URL=YOUR_URL TOKEN=YOUR_TOKEN NAME="Your Runtime"
+make run-nullalis
+```
+
 ## Required Preflight Sequence
 
 1. Check `/health`
diff --git a/docs/FAIRNESS_EXAMPLES.md b/docs/FAIRNESS_EXAMPLES.md
@@ -0,0 +1,43 @@
+# Fairness Examples
+
+TwinBench is only useful if weak-looking results are interpreted correctly.
+
+## Example 1: Scale
+
+Bad interpretation:
+
+- “This runtime has poor scale.”
+
+Correct interpretation:
+
+- “This runtime did not expose fair multi-user bootstrap for this run, so multi-user scale readiness remained partially measurable.”
+
+## Example 2: Missing Diagnostics
+
+Bad interpretation:
+
+- “The runtime has no operational introspection.”
+
+Correct interpretation:
+
+- “The benchmark could not access a machine-readable diagnostics surface, so introspection-backed checks were unavailable.”
+
+## Example 3: Same-User Contention
+
+Bad interpretation:
+
+- “The runtime failed concurrency.”
+
+Correct interpretation:
+
+- “The runtime serialized same-user work, which is a contention diagnostic rather than the primary multi-user scale claim.”
+
+## Example 4: Contract Mismatch
+
+Bad interpretation:
+
+- “The runtime scored badly against TwinBench.”
+
+Correct interpretation:
+
+- “The runtime was not directly contract-compatible, so the result should be treated as unsupported or adapter-needed rather than weak capability.”
diff --git a/docs/GETTING_STARTED.md b/docs/GETTING_STARTED.md
@@ -25,6 +25,8 @@ Use [docs/PREFLIGHT.md](PREFLIGHT.md) if you want a structured check.
 
 ## Step 3: Run the benchmark
 
+If you want an agent to do this for you, use [AGENT_RUN_GUIDE.md](AGENT_RUN_GUIDE.md).
+
 Generic runtime:
 
 ```bash
@@ -80,3 +82,5 @@ Do not force it. Start with:
 - [docs/INTEGRATION_PATHS.md](INTEGRATION_PATHS.md)
 
 TwinBench is strongest when unsupported behavior is reported honestly rather than hidden.
+
+If something goes wrong, start with [TROUBLESHOOTING.md](TROUBLESHOOTING.md).
diff --git a/docs/GITHUB_POLISH.md b/docs/GITHUB_POLISH.md
@@ -0,0 +1,32 @@
+# GitHub Polish Checklist
+
+These settings are outside repo files but should be updated for public launch.
+
+## Repository Description
+
+TwinBench: Open benchmark for personal AI assistant runtimes.
+
+## Suggested Topics
+
+- ai
+- benchmark
+- agents
+- personal-ai
+- ai-assistants
+- runtime
+- evaluation
+- memory
+- autonomous-agents
+
+## About Link
+
+- main repo URL
+- optionally Nova Nuggets homepage
+
+## Social Preview
+
+Use a benchmark card that includes:
+
+- TwinBench
+- Open Benchmark for Personal AI Assistant Runtimes
+- one short line: remember, act, follow up, stay safe, operate over time
diff --git a/docs/OVERVIEW.md b/docs/OVERVIEW.md
@@ -39,6 +39,7 @@ For the stronger launch rationale, read [WHY_TWINBENCH.md](WHY_TWINBENCH.md).
 - it publishes raw artifacts
 - it treats unsupported dimensions honestly
 - it is designed to be vendor-neutral and beatable in public
+- it is runnable by both human operators and coding agents
 
 ## What the Benchmark Produces
 
@@ -56,3 +57,4 @@ Each run produces:
 - preflight: [PREFLIGHT.md](PREFLIGHT.md)
 - public results: [RESULTS_INDEX.md](RESULTS_INDEX.md)
 - agent operators: [AGENT_RUN_GUIDE.md](AGENT_RUN_GUIDE.md)
+- launch rationale: [WHY_TWINBENCH.md](WHY_TWINBENCH.md)
diff --git a/docs/REFERENCE_RESULT_POLICY.md b/docs/REFERENCE_RESULT_POLICY.md
@@ -0,0 +1,39 @@
+# Reference Result Policy
+
+TwinBench uses several result classes.
+
+## Canonical Reference Result
+
+A checked-in, artifact-backed run that the repo uses as the main public example.
+
+Requirements:
+
+- public harness artifact
+- coherent JSON and report outputs
+- clear interpretation notes
+
+## Supporting Artifact
+
+A useful secondary artifact that explains a narrower point.
+
+Examples:
+
+- scale-only probe
+- degraded run
+- smoke run
+
+## Degraded Artifact
+
+A real run affected by runtime outage, upstream outage, or partial environment failure.
+
+These should not be hidden. They help prove the benchmark is honest.
+
+## External Estimate
+
+A non-ranked comparison row derived from public evidence, not from a verified live harness run.
+
+## Current Policy
+
+- keep the existing Nullalis full run as the canonical reference artifact
+- keep the scale-only probe as a supporting artifact
+- note publicly that scale interpretation in the canonical full artifact is conservative relative to later provisioning-aware fixes
diff --git a/docs/RESULTS_INDEX.md b/docs/RESULTS_INDEX.md
@@ -19,12 +19,14 @@ Headline:
 - Projected: `87.6`
 - Rating: `Production-Grade`
 - Result class: `reference runtime artifact`
+- Submit your runtime: [HOW_TO_SUBMIT.md](HOW_TO_SUBMIT.md)
 
 Notes:
 
 - strong core runtime behavior
 - earlier scale interpretation was conservative before provisioning-aware fixes
 - Nullalis is treated as the reference runtime, not the benchmark owner
+- scale interpretation is conservative relative to later provisioning-aware fixes
 
 ## How to Read a Result
 
@@ -43,6 +45,7 @@ Then check:
 These fields tell you whether a weak dimension was truly measured, only partially measured, unavailable, or blocked by environment or contract limitations.
 
 Use [ARTIFACT_SCHEMA.md](ARTIFACT_SCHEMA.md) for the field guide.
+See [REFERENCE_RESULT_POLICY.md](REFERENCE_RESULT_POLICY.md) for how TwinBench treats canonical, supporting, degraded, and external artifacts.
 
 ## Supporting Artifacts
 
diff --git a/docs/TROUBLESHOOTING.md b/docs/TROUBLESHOOTING.md
@@ -0,0 +1,90 @@
+# Troubleshooting
+
+This page covers the most common TwinBench failures.
+
+## 1. Auth Mismatch
+
+Symptoms:
+
+- `401 Unauthorized`
+- diagnostics unavailable
+- chat stream rejected immediately
+
+What it usually means:
+
+- wrong internal token
+- wrong auth header
+- local runtime token changed
+
+What to do:
+
+- verify `/internal/diagnostics` manually
+- verify `/api/v1/chat/stream` manually
+- for local Nullalis, prefer `--token-from-nullalis-config`
+
+## 2. Missing Diagnostics
+
+Symptoms:
+
+- `/internal/diagnostics` returns HTML, 404, or auth failure
+
+What it usually means:
+
+- runtime does not expose a machine-readable diagnostics surface
+- route exists in a control UI only
+
+What to do:
+
+- mark the dimension as unsupported or partially measurable
+- do not replace diagnostics with narrative claims
+
+## 3. Missing Bootstrap
+
+Symptoms:
+
+- multi-user scale requests fail with `unknown_user_id`
+- `/api/v1/users/provision` fails for benchmark users
+
+What it usually means:
+
+- tenant identity bootstrap is unavailable
+- benchmark users do not exist in the runtime’s identity layer
+
+What to do:
+
+- treat multi-user scale as unavailable or partially measured
+- do not interpret this as poor throughput by default
+
+## 4. Contract Mismatch
+
+Symptoms:
+
+- no `/api/v1/chat/stream`
+- no SSE response
+- control plane is WS-first or CLI-first
+
+What it usually means:
+
+- runtime is real, but not directly TwinBench-compatible yet
+
+What to do:
+
+- use [INTEGRATION_PATHS.md](INTEGRATION_PATHS.md)
+- classify it honestly as adapter-needed or partial
+
+## 5. Long-Running Turns
+
+Symptoms:
+
+- the benchmark appears stuck on one dimension
+- open-ended mode takes a very long time
+
+What it usually means:
+
+- the runtime is still working
+- or a turn is blocked and needs bounded timeout policy
+
+What to do:
+
+- prefer bounded or adaptive timeouts for public runs
+- use open-ended mode only when “faster is better” is the explicit benchmark policy
diff --git a/harness/report.py b/harness/report.py
diff --git a/results/nullalis-live-2026-03-25-openended.md b/results/nullalis-live-2026-03-25-openended.md
diff --git a/scripts/preflight.sh b/scripts/preflight.sh
diff --git a/scripts/run_twinbench.sh b/scripts/run_twinbench.sh
diff --git a/scripts/run_twinbench_nullalis_local.sh b/scripts/run_twinbench_nullalis_local.sh