Skip to content

Commit 196d37a

Browse files
committed
benchmark: add dynamic timeout policy and publish live v0.2 artifact
1 parent fcda57c commit 196d37a

22 files changed

Lines changed: 1185 additions & 223 deletions

.github/workflows/harness-ci.yml

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
name: Harness CI
2+
3+
on:
4+
push:
5+
branches: [main]
6+
pull_request:
7+
branches: [main]
8+
9+
jobs:
10+
harness-ci:
11+
runs-on: ubuntu-latest
12+
steps:
13+
- name: Checkout
14+
uses: actions/checkout@v4
15+
16+
- name: Setup Python 3.10
17+
uses: actions/setup-python@v5
18+
with:
19+
python-version: "3.10"
20+
21+
- name: Install dependencies
22+
run: |
23+
python -m pip install --upgrade pip
24+
pip install -r harness/requirements.txt
25+
26+
- name: Compile harness sources
27+
run: python -m py_compile harness/*.py
28+
29+
- name: Runner CLI smoke
30+
run: python -m harness.runner --help
31+
32+
- name: Validate v0.2 result artifact schema
33+
run: |
34+
python - <<'PY'
35+
import json
36+
from pathlib import Path
37+
38+
p = Path("results/nullalis-v0.2.json")
39+
if not p.exists():
40+
raise SystemExit("results/nullalis-v0.2.json not found")
41+
42+
data = json.loads(p.read_text())
43+
required = [
44+
"benchmark_version",
45+
"verified_composite_score",
46+
"projected_composite_score",
47+
"measured_coverage",
48+
"coverage_adjusted_verified_score",
49+
"dimension_verified_scores",
50+
"dimension_projected_scores",
51+
"dimension_measured_coverage",
52+
]
53+
missing = [k for k in required if k not in data]
54+
if missing:
55+
raise SystemExit(f"missing keys: {missing}")
56+
if data["benchmark_version"] != "0.2":
57+
raise SystemExit("benchmark_version must be 0.2")
58+
print("artifact schema ok")
59+
PY

CONTRIBUTING.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,18 @@ We welcome contributions from the community. Here's how you can help.
66

77
The most impactful contribution is running DTaaS-Bench against your runtime and submitting verified results.
88

9-
1. Run the harness: `python -m harness.runner --url YOUR_URL --token YOUR_TOKEN --user-id 1 --name "Your Runtime" --output results/your-runtime.json`
9+
1. Run the harness: `python3.10 -m harness.runner --url YOUR_URL --token YOUR_TOKEN --user-id 1 --name "Your Runtime" --output results/your-runtime.json`
1010
2. Open an issue using the **Submit Results** template
1111
3. Attach your `results/your-runtime.json`
1212

1313
We publish all verified results in the leaderboard.
1414

15+
Required v0.2 fields for leaderboard eligibility:
16+
- `verified_composite_score`
17+
- `projected_composite_score`
18+
- `measured_coverage`
19+
- `coverage_adjusted_verified_score`
20+
1521
## Propose New Dimensions
1622

1723
If you believe an important DTaaS capability is not covered by the current 10 dimensions, open an issue with:

README.md

Lines changed: 49 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
<p align="center">
22
<h1 align="center">DTaaS-Bench</h1>
33
<p align="center">
4-
<strong>The first industry benchmark for persistent autonomous AI agent runtimes.</strong>
4+
<strong>Open benchmark for persistent autonomous AI agent runtimes.</strong>
55
</p>
66
<p align="center">
77
<a href="SPECIFICATION.md">Specification</a> &middot;
8-
<a href="results/nullalis-v0.1-report.md">Results</a> &middot;
8+
<a href="results/nullalis-v0.2-report.md">Results</a> &middot;
99
<a href="CONTRIBUTING.md">Contribute</a> &middot;
1010
<a href="https://novanuggets.com">Nova Nuggets</a>
1111
</p>
@@ -17,39 +17,38 @@
1717

1818
**None of them measure what a Digital Twin runtime must do: remember durably, act autonomously, operate across multiple channels, enforce safety on background turns, and scale to thousands of users — all at once.**
1919

20-
DTaaS-Bench does. 10 dimensions. One composite score. Open harness. Run it against your runtime today.
20+
DTaaS-Bench does. 10 dimensions. Two explicit composites (`verified` + `projected`). Open harness. Run it against your runtime today.
2121

2222
---
2323

24-
## Leaderboard (March 2026)
24+
## Verified Leaderboard (March 2026)
2525

26-
| Runtime | Type | Score | Rating | Best Dimension |
27-
|---------|------|-------|--------|----------------|
28-
| **[Nullalis v0.1](results/nullalis-v0.1-report.md)** | Full runtime | **87** | **Category Leader** | Integration Breadth (97) |
29-
| [OpenClaw](https://github.com/openclaw) (est.) | Full runtime | 62 | Competitive | Channels (75) |
30-
| [NanoBot](https://github.com/hkuds/nanobot) (est.) | Lightweight runtime | 52 | Emerging | Latency (70) |
31-
| [Letta](https://github.com/letta-ai/letta) (est.) | Agent framework | 46 | Emerging | Memory (85) |
32-
| [Mem0](https://mem0.ai) (est.) | Memory platform | 41 | Specialized | Memory (88) |
26+
Only runs produced by this harness with raw artifacts are ranked here.
3327

34-
> Estimated scores based on public documentation (March 2026). **Submit verified results to join the leaderboard.**
28+
| Runtime | Type | Verified (coverage-adjusted) | Coverage | Projected | Rating |
29+
|---------|------|------------------------------|----------|-----------|--------|
30+
| **[Nullalis v0.2 live artifact](results/nullalis-v0.2-report.md)** | Full runtime | **64.4** | **78.9%** | **79.8** | **Competitive** |
3531

36-
<details>
37-
<summary><strong>Full dimension breakdown</strong></summary>
32+
## External Comparison (Unverified, not ranked)
3833

39-
| Dimension (weight) | Nullalis | OpenClaw | NanoBot | Letta | Mem0 |
40-
|--------------------|----------|----------|---------|-------|------|
41-
| Autonomy Control (0.15) | 96 | 45 | 30 | 30 | 15 |
42-
| Memory Persistence (0.15) | 91 | 70 | 55 | 85 | 88 |
43-
| Functional Capability (0.15) | 85 | 75 | 65 | 55 | 50 |
44-
| Autonomous Execution (0.12) | 85 | 55 | 50 | 25 | 10 |
45-
| Cross-Channel (0.12) | 85 | 75 | 50 | 15 | 10 |
46-
| Integration Breadth (0.08) | 97 | 70 | 35 | 35 | 25 |
47-
| Security & Privacy (0.08) | 89 | 35 | 40 | 40 | 30 |
48-
| Scale & Cost (0.05) | 74 | 55 | 65 | 55 | 65 |
49-
| Resilience (0.05) | 78 | 50 | 45 | 55 | 50 |
50-
| Latency (0.05) | 72 | 70 | 70 | 65 | 70 |
34+
External runtime rows are reference estimates from public documentation and are intentionally excluded from the verified leaderboard.
5135

52-
</details>
36+
| Runtime | Type | Status |
37+
|---------|------|--------|
38+
| OpenClaw | Full runtime | Unverified external estimate |
39+
| NanoBot | Lightweight runtime | Unverified external estimate |
40+
| Letta | Agent framework | Unverified external estimate |
41+
| Mem0 | Memory platform | Unverified external estimate |
42+
43+
Submit a harness artifact to replace any estimate with a verified score.
44+
45+
## v0.2 Integrity Highlights
46+
47+
- Coverage-aware leaderboard scoring is now default.
48+
- Every run publishes both `verified` and `projected` composites.
49+
- External estimates are explicitly non-ranked.
50+
- CI now validates harness integrity and v0.2 artifact schema on every PR.
51+
- Latest live gateway run completed in 4392.9s (no per-chat timeout mode) and reports full measured timing metadata in the artifact.
5352

5453
---
5554

@@ -95,31 +94,41 @@ A Digital Twin as a Service runtime is fundamentally different:
9594
| 25-39 | **Specialized** | Strong in specific dimensions, not full-stack |
9695
| <25 | **Early Stage** | Research or proof-of-concept |
9796

97+
## Scoring Integrity Contract
98+
99+
Every run reports:
100+
- `verified_composite_score` (measured components only)
101+
- `projected_composite_score` (includes projection assumptions)
102+
- `measured_coverage` (how much of scoring weight was actually measured)
103+
- `coverage_adjusted_verified_score` (used for tiering in v0.2)
104+
105+
The benchmark never promotes an unverified external estimate to leaderboard status.
106+
98107
---
99108

100109
## Quick Start
101110

102111
```bash
103112
git clone https://github.com/ProjectNuggets/DTaaS-benchmark.git
104113
cd DTaaS-benchmark
105-
pip install -r harness/requirements.txt
114+
python3.10 -m pip install -r harness/requirements.txt
106115

107116
# Run all 10 dimensions
108-
python -m harness.runner \
117+
python3.10 -m harness.runner \
109118
--url http://localhost:8080 \
110119
--token YOUR_TOKEN \
111120
--user-id 1 \
112121
--name "My Runtime"
113122

114123
# Run specific dimensions
115-
python -m harness.runner \
124+
python3.10 -m harness.runner \
116125
--url http://localhost:8080 \
117126
--token YOUR_TOKEN \
118127
--user-id 1 \
119128
--dimensions memory,security,functional
120129

121130
# Full report suite (JSON + Markdown + HTML)
122-
python -m harness.runner \
131+
python3.10 -m harness.runner \
123132
--url http://localhost:8080 \
124133
--token YOUR_TOKEN \
125134
--user-id 1 \
@@ -131,6 +140,9 @@ python -m harness.runner \
131140

132141
The harness produces a terminal summary table, a machine-readable JSON file, a Markdown report, and a self-contained HTML report with score bars and tier badges.
133142

143+
Runtime requirement:
144+
- Python 3.10+ (the harness uses modern type syntax)
145+
134146
---
135147

136148
## How It Works
@@ -144,7 +156,7 @@ The harness talks to any runtime via HTTP:
144156
| `/internal/diagnostics` | GET | Runtime introspection | Yes |
145157
| `/metrics` | GET | Prometheus metrics | Optional |
146158

147-
Each dimension script sends real requests to the runtime, parses responses, and scores based on the [SPECIFICATION](SPECIFICATION.md). No mocks. No synthetic benchmarks. Real agent behavior.
159+
Each dimension script sends real requests to the runtime, parses responses, and scores based on the [SPECIFICATION](SPECIFICATION.md). Measured and projected components are surfaced separately.
148160

149161
---
150162

@@ -180,8 +192,10 @@ DTaaS-benchmark/
180192
│ ├── attack_payloads.json Path traversal + SSRF inputs
181193
│ └── schedules.json Task scheduling definitions
182194
└── results/
183-
├── nullalis-v0.1.json Machine-readable reference results
184-
└── nullalis-v0.1-report.md Human-readable reference report
195+
├── nullalis-v0.1.json Legacy v0.1 reference artifact
196+
├── nullalis-v0.1-report.md Legacy v0.1 report
197+
├── nullalis-v0.2.json v0.2 live gateway artifact
198+
└── nullalis-v0.2-report.md v0.2 live gateway report
185199
```
186200

187201
---
@@ -201,6 +215,5 @@ Apache-2.0 — See [LICENSE](LICENSE).
201215
<p align="center">
202216
<strong>Published by <a href="https://novanuggets.com">Nova Nuggets</a></strong><br>
203217
Handcrafted Intelligence. Own Your AI.<br><br>
204-
<strong>Nullalis</strong> (ZAKI BOT) is the reference implementation for DTaaS-Bench<br>
205-
and the first runtime to achieve <strong>Category Leader</strong> (87/100).
218+
<strong>Nullalis</strong> (ZAKI BOT) is the current reference implementation for DTaaS-Bench.
206219
</p>

0 commit comments

Comments
 (0)