11<p align =" center " >
22 <h1 align =" center " >DTaaS-Bench</h1 >
33 <p align =" center " >
4- <strong>The first industry benchmark for persistent autonomous AI agent runtimes.</strong>
4+ <strong>Open benchmark for persistent autonomous AI agent runtimes.</strong>
55 </p >
66 <p align =" center " >
77 <a href="SPECIFICATION.md">Specification</a> ·
8- <a href="results/nullalis-v0.1 -report.md">Results</a> ·
8+ <a href="results/nullalis-v0.2 -report.md">Results</a> ·
99 <a href="CONTRIBUTING.md">Contribute</a> ·
1010 <a href="https://novanuggets.com">Nova Nuggets</a>
1111 </p >
1717
1818** None of them measure what a Digital Twin runtime must do: remember durably, act autonomously, operate across multiple channels, enforce safety on background turns, and scale to thousands of users — all at once.**
1919
20- DTaaS-Bench does. 10 dimensions. One composite score . Open harness. Run it against your runtime today.
20+ DTaaS-Bench does. 10 dimensions. Two explicit composites ( ` verified ` + ` projected ` ) . Open harness. Run it against your runtime today.
2121
2222---
2323
24- ## Leaderboard (March 2026)
24+ ## Verified Leaderboard (March 2026)
2525
26- | Runtime | Type | Score | Rating | Best Dimension |
27- | ---------| ------| -------| --------| ----------------|
28- | ** [ Nullalis v0.1] ( results/nullalis-v0.1-report.md ) ** | Full runtime | ** 87** | ** Category Leader** | Integration Breadth (97) |
29- | [ OpenClaw] ( https://github.com/openclaw ) (est.) | Full runtime | 62 | Competitive | Channels (75) |
30- | [ NanoBot] ( https://github.com/hkuds/nanobot ) (est.) | Lightweight runtime | 52 | Emerging | Latency (70) |
31- | [ Letta] ( https://github.com/letta-ai/letta ) (est.) | Agent framework | 46 | Emerging | Memory (85) |
32- | [ Mem0] ( https://mem0.ai ) (est.) | Memory platform | 41 | Specialized | Memory (88) |
26+ Only runs produced by this harness with raw artifacts are ranked here.
3327
34- > Estimated scores based on public documentation (March 2026). ** Submit verified results to join the leaderboard.**
28+ | Runtime | Type | Verified (coverage-adjusted) | Coverage | Projected | Rating |
29+ | ---------| ------| ------------------------------| ----------| -----------| --------|
30+ | ** [ Nullalis v0.2 live artifact] ( results/nullalis-v0.2-report.md ) ** | Full runtime | ** 64.4** | ** 78.9%** | ** 79.8** | ** Competitive** |
3531
36- <details >
37- <summary ><strong >Full dimension breakdown</strong ></summary >
32+ ## External Comparison (Unverified, not ranked)
3833
39- | Dimension (weight) | Nullalis | OpenClaw | NanoBot | Letta | Mem0 |
40- | --------------------| ----------| ----------| ---------| -------| ------|
41- | Autonomy Control (0.15) | 96 | 45 | 30 | 30 | 15 |
42- | Memory Persistence (0.15) | 91 | 70 | 55 | 85 | 88 |
43- | Functional Capability (0.15) | 85 | 75 | 65 | 55 | 50 |
44- | Autonomous Execution (0.12) | 85 | 55 | 50 | 25 | 10 |
45- | Cross-Channel (0.12) | 85 | 75 | 50 | 15 | 10 |
46- | Integration Breadth (0.08) | 97 | 70 | 35 | 35 | 25 |
47- | Security & Privacy (0.08) | 89 | 35 | 40 | 40 | 30 |
48- | Scale & Cost (0.05) | 74 | 55 | 65 | 55 | 65 |
49- | Resilience (0.05) | 78 | 50 | 45 | 55 | 50 |
50- | Latency (0.05) | 72 | 70 | 70 | 65 | 70 |
34+ External runtime rows are reference estimates from public documentation and are intentionally excluded from the verified leaderboard.
5135
52- </details >
36+ | Runtime | Type | Status |
37+ | ---------| ------| --------|
38+ | OpenClaw | Full runtime | Unverified external estimate |
39+ | NanoBot | Lightweight runtime | Unverified external estimate |
40+ | Letta | Agent framework | Unverified external estimate |
41+ | Mem0 | Memory platform | Unverified external estimate |
42+
43+ Submit a harness artifact to replace any estimate with a verified score.
44+
45+ ## v0.2 Integrity Highlights
46+
47+ - Coverage-aware leaderboard scoring is now default.
48+ - Every run publishes both ` verified ` and ` projected ` composites.
49+ - External estimates are explicitly non-ranked.
50+ - CI now validates harness integrity and v0.2 artifact schema on every PR.
51+ - Latest live gateway run completed in 4392.9s (no per-chat timeout mode) and reports full measured timing metadata in the artifact.
5352
5453---
5554
@@ -95,31 +94,41 @@ A Digital Twin as a Service runtime is fundamentally different:
9594| 25-39 | ** Specialized** | Strong in specific dimensions, not full-stack |
9695| <25 | ** Early Stage** | Research or proof-of-concept |
9796
97+ ## Scoring Integrity Contract
98+
99+ Every run reports:
100+ - ` verified_composite_score ` (measured components only)
101+ - ` projected_composite_score ` (includes projection assumptions)
102+ - ` measured_coverage ` (how much of scoring weight was actually measured)
103+ - ` coverage_adjusted_verified_score ` (used for tiering in v0.2)
104+
105+ The benchmark never promotes an unverified external estimate to leaderboard status.
106+
98107---
99108
100109## Quick Start
101110
102111``` bash
103112git clone https://github.com/ProjectNuggets/DTaaS-benchmark.git
104113cd DTaaS-benchmark
105- pip install -r harness/requirements.txt
114+ python3.10 -m pip install -r harness/requirements.txt
106115
107116# Run all 10 dimensions
108- python -m harness.runner \
117+ python3.10 -m harness.runner \
109118 --url http://localhost:8080 \
110119 --token YOUR_TOKEN \
111120 --user-id 1 \
112121 --name " My Runtime"
113122
114123# Run specific dimensions
115- python -m harness.runner \
124+ python3.10 -m harness.runner \
116125 --url http://localhost:8080 \
117126 --token YOUR_TOKEN \
118127 --user-id 1 \
119128 --dimensions memory,security,functional
120129
121130# Full report suite (JSON + Markdown + HTML)
122- python -m harness.runner \
131+ python3.10 -m harness.runner \
123132 --url http://localhost:8080 \
124133 --token YOUR_TOKEN \
125134 --user-id 1 \
@@ -131,6 +140,9 @@ python -m harness.runner \
131140
132141The harness produces a terminal summary table, a machine-readable JSON file, a Markdown report, and a self-contained HTML report with score bars and tier badges.
133142
143+ Runtime requirement:
144+ - Python 3.10+ (the harness uses modern type syntax)
145+
134146---
135147
136148## How It Works
@@ -144,7 +156,7 @@ The harness talks to any runtime via HTTP:
144156| ` /internal/diagnostics ` | GET | Runtime introspection | Yes |
145157| ` /metrics ` | GET | Prometheus metrics | Optional |
146158
147- Each dimension script sends real requests to the runtime, parses responses, and scores based on the [ SPECIFICATION] ( SPECIFICATION.md ) . No mocks. No synthetic benchmarks. Real agent behavior .
159+ Each dimension script sends real requests to the runtime, parses responses, and scores based on the [ SPECIFICATION] ( SPECIFICATION.md ) . Measured and projected components are surfaced separately .
148160
149161---
150162
@@ -180,8 +192,10 @@ DTaaS-benchmark/
180192│ ├── attack_payloads.json Path traversal + SSRF inputs
181193│ └── schedules.json Task scheduling definitions
182194└── results/
183- ├── nullalis-v0.1.json Machine-readable reference results
184- └── nullalis-v0.1-report.md Human-readable reference report
195+ ├── nullalis-v0.1.json Legacy v0.1 reference artifact
196+ ├── nullalis-v0.1-report.md Legacy v0.1 report
197+ ├── nullalis-v0.2.json v0.2 live gateway artifact
198+ └── nullalis-v0.2-report.md v0.2 live gateway report
185199```
186200
187201---
@@ -201,6 +215,5 @@ Apache-2.0 — See [LICENSE](LICENSE).
201215<p align =" center " >
202216 <strong >Published by <a href =" https://novanuggets.com " >Nova Nuggets</a ></strong ><br >
203217 Handcrafted Intelligence. Own Your AI.<br ><br >
204- <strong >Nullalis</strong > (ZAKI BOT) is the reference implementation for DTaaS-Bench<br >
205- and the first runtime to achieve <strong >Category Leader</strong > (87/100).
218+ <strong >Nullalis</strong > (ZAKI BOT) is the current reference implementation for DTaaS-Bench.
206219</p >
0 commit comments