Skip to content

Commit df360c6

Browse files
committed
Polish TwinBench for S-tier public launch
1 parent dcf341d commit df360c6

19 files changed

Lines changed: 540 additions & 16 deletions

Makefile

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
PYTHON ?= python3.10
2+
URL ?= http://localhost:8080
3+
TOKEN ?=
4+
USER_ID ?= 1
5+
NAME ?= My Runtime
6+
7+
.PHONY: preflight run run-nullalis
8+
9+
preflight:
10+
bash scripts/preflight.sh "$(URL)" "$(TOKEN)" "$(USER_ID)"
11+
12+
run:
13+
bash scripts/run_twinbench.sh "$(URL)" "$(TOKEN)" "$(NAME)" "$(USER_ID)"
14+
15+
run-nullalis:
16+
bash scripts/run_twinbench_nullalis_local.sh

README.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,20 @@ Local Nullalis:
2222
python3.10 -m harness.runner --url http://127.0.0.1:3000 --token-from-nullalis-config --user-id 1 --name "Nullalis Local" --output results/nullalis-local.json --markdown results/nullalis-local.md --html results/nullalis-local.html
2323
```
2424

25+
Scripted shortcuts:
26+
27+
```bash
28+
make preflight URL=http://localhost:8080 TOKEN=YOUR_TOKEN
29+
make run URL=http://localhost:8080 TOKEN=YOUR_TOKEN NAME="My Runtime"
30+
make run-nullalis
31+
```
32+
2533
Quick links:
2634
- [Overview](docs/OVERVIEW.md)
2735
- [Why TwinBench](docs/WHY_TWINBENCH.md)
2836
- [Getting Started in 10 Minutes](docs/GETTING_STARTED.md)
2937
- [Run with Agents](docs/AGENT_RUN_GUIDE.md)
38+
- [Troubleshooting](docs/TROUBLESHOOTING.md)
3039
- [Run Profiles](docs/RUN_PROFILES.md)
3140
- [Compatibility Checklist](docs/COMPATIBILITY_CHECKLIST.md)
3241
- [Preflight Checklist](docs/PREFLIGHT.md)

assets/README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# TwinBench Public Assets
2+
3+
This folder contains lightweight visual assets for launch and social sharing.
4+
5+
- `benchmark-score-card.svg`
6+
- `ten-dimensions.svg`
7+
- `what-current-benchmarks-miss.svg`

assets/benchmark-score-card.svg

Lines changed: 11 additions & 0 deletions
Loading

assets/ten-dimensions.svg

Lines changed: 17 additions & 0 deletions
Loading
Lines changed: 19 additions & 0 deletions
Loading

docs/AGENT_RUN_GUIDE.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,22 @@ python3.10 -m harness.runner --url http://127.0.0.1:3000 --token-from-nullalis-c
2424
Run TwinBench against this runtime at URL X using token Y. First perform the preflight checks, then run the harness, save JSON, Markdown, and HTML artifacts, and summarize the verified score, projected score, measured coverage, dimension statuses, and any unavailable dimensions with reason codes.
2525
```
2626

27+
## Scripted Shortcuts
28+
29+
```bash
30+
bash scripts/preflight.sh YOUR_URL YOUR_TOKEN 1
31+
bash scripts/run_twinbench.sh YOUR_URL YOUR_TOKEN "Your Runtime" 1
32+
bash scripts/run_twinbench_nullalis_local.sh
33+
```
34+
35+
Or with `make`:
36+
37+
```bash
38+
make preflight URL=YOUR_URL TOKEN=YOUR_TOKEN
39+
make run URL=YOUR_URL TOKEN=YOUR_TOKEN NAME="Your Runtime"
40+
make run-nullalis
41+
```
42+
2743
## Required Preflight Sequence
2844

2945
1. Check `/health`

docs/FAIRNESS_EXAMPLES.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Fairness Examples
2+
3+
TwinBench is only useful if weak-looking results are interpreted correctly.
4+
5+
## Example 1: Scale
6+
7+
Bad interpretation:
8+
9+
- “This runtime has poor scale.”
10+
11+
Correct interpretation:
12+
13+
- “This runtime did not expose fair multi-user bootstrap for this run, so multi-user scale readiness remained partially measurable.”
14+
15+
## Example 2: Missing Diagnostics
16+
17+
Bad interpretation:
18+
19+
- “The runtime has no operational introspection.”
20+
21+
Correct interpretation:
22+
23+
- “The benchmark could not access a machine-readable diagnostics surface, so introspection-backed checks were unavailable.”
24+
25+
## Example 3: Same-User Contention
26+
27+
Bad interpretation:
28+
29+
- “The runtime failed concurrency.”
30+
31+
Correct interpretation:
32+
33+
- “The runtime serialized same-user work, which is a contention diagnostic rather than the primary multi-user scale claim.”
34+
35+
## Example 4: Contract Mismatch
36+
37+
Bad interpretation:
38+
39+
- “The runtime scored badly against TwinBench.”
40+
41+
Correct interpretation:
42+
43+
- “The runtime was not directly contract-compatible, so the result should be treated as unsupported or adapter-needed rather than weak capability.”

docs/GETTING_STARTED.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@ Use [docs/PREFLIGHT.md](PREFLIGHT.md) if you want a structured check.
2525

2626
## Step 3: Run the benchmark
2727

28+
If you want an agent to do this for you, use [AGENT_RUN_GUIDE.md](AGENT_RUN_GUIDE.md).
29+
2830
Generic runtime:
2931

3032
```bash
@@ -80,3 +82,5 @@ Do not force it. Start with:
8082
- [docs/INTEGRATION_PATHS.md](INTEGRATION_PATHS.md)
8183

8284
TwinBench is strongest when unsupported behavior is reported honestly rather than hidden.
85+
86+
If something goes wrong, start with [TROUBLESHOOTING.md](TROUBLESHOOTING.md).

docs/GITHUB_POLISH.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# GitHub Polish Checklist
2+
3+
These settings are outside repo files but should be updated for public launch.
4+
5+
## Repository Description
6+
7+
TwinBench: Open benchmark for personal AI assistant runtimes.
8+
9+
## Suggested Topics
10+
11+
- ai
12+
- benchmark
13+
- agents
14+
- personal-ai
15+
- ai-assistants
16+
- runtime
17+
- evaluation
18+
- memory
19+
- autonomous-agents
20+
21+
## About Link
22+
23+
- main repo URL
24+
- optionally Nova Nuggets homepage
25+
26+
## Social Preview
27+
28+
Use a benchmark card that includes:
29+
30+
- TwinBench
31+
- Open Benchmark for Personal AI Assistant Runtimes
32+
- one short line: remember, act, follow up, stay safe, operate over time

0 commit comments

Comments
 (0)