Skip to content

Commit 77034d4

Browse files
committed
Harden benchmark evidence and Nullalis local runner flow
1 parent 07440ed commit 77034d4

44 files changed

Lines changed: 3332 additions & 282 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/ISSUE_TEMPLATE/submit-results.md

Lines changed: 34 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -11,39 +11,59 @@ labels: results
1111
- **Version**:
1212
- **URL/Repo**:
1313
- **LLM Provider Used**:
14+
- **Runtime Commit SHA**:
15+
- **Harness Commit SHA**:
1416

1517
## Benchmark Run
1618

17-
- **DTaaS-Bench Version**: 0.1
19+
- **DTaaS-Bench Version**: 0.2
1820
- **Date**:
1921
- **Platform**:
2022
- **Harness Command Used**:
2123

2224
```bash
23-
python harness/runner.py --url <URL> --token <TOKEN> --user-id <ID> --name "<Name>"
25+
python -m harness.runner --url <URL> --token <TOKEN> --user-id <ID> --name "<Name>" --output results/<runtime>.json --markdown results/<runtime>.md --html results/<runtime>.html
2426
```
2527

2628
## Results
2729

28-
**Composite Score**: /100
30+
**Verified Composite Score**: /100
31+
**Projected Composite Score**: /100
32+
**Measured Coverage**:
33+
**Coverage-Adjusted Verified Score**: /100
2934
**Rating**:
3035

3136
### Dimension Scores
3237

33-
| Dimension | Score |
34-
|-----------|-------|
35-
| Autonomy Control | |
36-
| Memory Persistence | |
37-
| Autonomous Execution | |
38-
| Cross-Channel Consistency | |
39-
| Integration Breadth | |
40-
| Security & Privacy | |
41-
| Scale & Cost Efficiency | |
42-
| Operational Resilience | |
43-
| Latency Profile | |
38+
| Dimension | Verified | Projected | Coverage |
39+
|-----------|----------|-----------|----------|
40+
| Autonomy Control | | | |
41+
| Memory Persistence | | | |
42+
| Functional Capability | | | |
43+
| Autonomous Execution | | | |
44+
| Cross-Channel Consistency | | | |
45+
| Integration Breadth | | | |
46+
| Security & Privacy | | | |
47+
| Scale & Cost Efficiency | | | |
48+
| Operational Resilience | | | |
49+
| Latency Profile | | | |
50+
51+
## Evidence Pack
52+
53+
- **Diagnostics attached**: yes/no
54+
- **Metrics attached**: yes/no
55+
- **Run manifest attached**: yes/no
56+
- **Any runtime/provider incident during run**:
57+
- **Incident attribution**: runtime / upstream dependency / network / unknown
58+
- **Notes on projected components**:
4459

4560
## Attachments
4661

4762
Please attach:
4863
- [ ] `results/<runtime>.json` output from the harness
64+
- [ ] `results/<runtime>.md` or `results/<runtime>.html`
4965
- [ ] Any notes on configuration or setup
66+
- [ ] `/internal/diagnostics` snapshot if available
67+
- [ ] `/metrics` snapshot if available
68+
- [ ] Any incident notes or representative error messages if the run degraded
69+
- [ ] Optional run manifest based on `docs/run-manifest-v0.2.example.json`

CHANGELOG.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# Changelog
2+
3+
All notable changes to DTaaS-Bench should be documented in this file.
4+
5+
The format is intentionally simple and release-oriented.
6+
7+
## [0.2.1] - 2026-03-24
8+
9+
### Added
10+
11+
- Trust model documentation in `docs/TRUST_MODEL.md`
12+
- Nullalis reference-runtime explainer in `docs/NULLALIS_REFERENCE_RUNTIME.md`
13+
- Example run manifest in `docs/run-manifest-v0.2.example.json`
14+
- Evidence parsing helpers in `harness/evidence.py`
15+
- Open-source release checklist in `docs/OPEN_SOURCE_RELEASE_CHECKLIST.md`
16+
17+
### Changed
18+
19+
- Updated result submission template to require v0.2-style verified/projected reporting
20+
- Updated README and CONTRIBUTING flow around trusted evidence packs
21+
- Hardened chat transport for runtimes that require explicit `session_key`
22+
- Isolated harness chat sessions per dimension
23+
- Improved autonomy scoring to use diagnostics and metrics more directly
24+
- Improved execution scoring with diagnostics-backed scheduler confirmation
25+
- Improved breadth scoring with runtime_info-backed structured counts
26+
- Reduced cross-channel reliance on generic assistant self-report
27+
28+
### Notes
29+
30+
- This release is focused on benchmark trust, open-source readiness, and compatibility with the current Nullalis gateway contract.

CONTRIBUTING.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ We welcome contributions from the community. Here's how you can help.
66

77
The most impactful contribution is running DTaaS-Bench against your runtime and submitting verified results.
88

9-
1. Run the harness: `python3.10 -m harness.runner --url YOUR_URL --token YOUR_TOKEN --user-id 1 --name "Your Runtime" --output results/your-runtime.json`
9+
1. Run the harness: `python3.10 -m harness.runner --url YOUR_URL --token YOUR_TOKEN --user-id 1 --name "Your Runtime" --output results/your-runtime.json --markdown results/your-runtime.md --html results/your-runtime.html`
1010
2. Open an issue using the **Submit Results** template
1111
3. Attach your `results/your-runtime.json`
1212

@@ -18,6 +18,17 @@ Required v0.2 fields for leaderboard eligibility:
1818
- `measured_coverage`
1919
- `coverage_adjusted_verified_score`
2020

21+
Recommended submission attachments:
22+
- `results/your-runtime.md` or `results/your-runtime.html`
23+
- Runtime version and commit SHA
24+
- Harness commit SHA
25+
- `/internal/diagnostics` snapshot
26+
- `/metrics` snapshot when available
27+
- brief incident note if the runtime, network, or model provider degraded during the run
28+
- Optional run manifest based on `docs/run-manifest-v0.2.example.json`
29+
30+
Please also read `docs/TRUST_MODEL.md` before submitting. The goal is not just a high score; it is a result other builders will trust.
31+
2132
## Propose New Dimensions
2233

2334
If you believe an important DTaaS capability is not covered by the current 10 dimensions, open an issue with:

README.md

Lines changed: 51 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@
55
</p>
66
<p align="center">
77
<a href="SPECIFICATION.md">Specification</a> &middot;
8+
<a href="docs/TRUST_MODEL.md">Trust Model</a> &middot;
9+
<a href="CHANGELOG.md">Changelog</a> &middot;
810
<a href="results/nullalis-v0.2-report.md">Results</a> &middot;
911
<a href="CONTRIBUTING.md">Contribute</a> &middot;
1012
<a href="https://novanuggets.com">Nova Nuggets</a>
@@ -19,6 +21,8 @@
1921

2022
DTaaS-Bench does. 10 dimensions. Two explicit composites (`verified` + `projected`). Open harness. Run it against your runtime today.
2123

24+
This repo is also an argument: there is a real category between "chatbot" and "agent framework" that deserves its own evaluation standard. We call that category `Digital Twin as a Service` (`DTaaS`).
25+
2226
---
2327

2428
## Verified Leaderboard (March 2026)
@@ -29,6 +33,8 @@ Only runs produced by this harness with raw artifacts are ranked here.
2933
|---------|------|------------------------------|----------|-----------|--------|
3034
| **[Nullalis v0.2 live artifact](results/nullalis-v0.2-report.md)** | Full runtime | **64.4** | **78.9%** | **79.8** | **Competitive** |
3135

36+
Nullalis is the first reference runtime in this repository because it demonstrates the full stack this benchmark is naming: persistent memory, autonomous execution, multi-channel operation, and background-turn controls. Read [why that matters](docs/NULLALIS_REFERENCE_RUNTIME.md).
37+
3238
## External Comparison (Unverified, not ranked)
3339

3440
External runtime rows are reference estimates from public documentation and are intentionally excluded from the verified leaderboard.
@@ -50,6 +56,16 @@ Submit a harness artifact to replace any estimate with a verified score.
5056
- CI now validates harness integrity and v0.2 artifact schema on every PR.
5157
- Latest live gateway run completed in 4392.9s (no per-chat timeout mode) and reports full measured timing metadata in the artifact.
5258

59+
## Why This Repo Exists
60+
61+
DTaaS-Bench is trying to do three jobs at once:
62+
63+
- define the `DTaaS` runtime category clearly enough that people can argue about it in public
64+
- give builders a benchmark they can actually run against local and SaaS deployments
65+
- create a trusted submission standard so the leaderboard is earned rather than narrated
66+
67+
Nullalis is important here because it is a strong reference runtime, not because it gets special treatment. The benchmark should become more valuable as more runtimes submit verified artifacts and try to beat it.
68+
5369
---
5470

5571
## Why DTaaS Needs Its Own Benchmark
@@ -104,6 +120,9 @@ Every run reports:
104120

105121
The benchmark never promotes an unverified external estimate to leaderboard status.
106122

123+
For the full evidence model, read [docs/TRUST_MODEL.md](docs/TRUST_MODEL.md).
124+
For nearby runtime compatibility status, read [docs/COMPETITOR_RUNNABILITY.md](docs/COMPETITOR_RUNNABILITY.md).
125+
107126
---
108127

109128
## Quick Start
@@ -136,10 +155,21 @@ python3.10 -m harness.runner \
136155
--output results/run.json \
137156
--markdown results/run.md \
138157
--html results/run.html
158+
159+
# Local Nullalis: discover the active token from ~/.nullalis/config.json
160+
python3.10 -m harness.runner \
161+
--url http://127.0.0.1:3000 \
162+
--token-from-nullalis-config \
163+
--user-id 1 \
164+
--name "Nullalis Local"
139165
```
140166

141167
The harness produces a terminal summary table, a machine-readable JSON file, a Markdown report, and a self-contained HTML report with score bars and tier badges.
142168

169+
Useful local tuning:
170+
- `--memory-sample-size 5` for faster validation reruns
171+
- `--timeout 20 --timeout-dynamic --timeout-floor 20 --timeout-ceiling 120 --timeout-grace 10` for bounded local gateway runs
172+
143173
Runtime requirement:
144174
- Python 3.10+ (the harness uses modern type syntax)
145175

@@ -164,21 +194,38 @@ Each dimension script sends real requests to the runtime, parses responses, and
164194

165195
Run the benchmark. Submit your score. Join the leaderboard.
166196

167-
1. `python -m harness.runner --url YOUR_URL --token TOKEN --user-id 1 --name "Your Runtime" --output results/your-runtime.json`
197+
1. `python -m harness.runner --url YOUR_URL --token TOKEN --user-id 1 --name "Your Runtime" --output results/your-runtime.json --markdown results/your-runtime.md --html results/your-runtime.html`
168198
2. Open an issue using the [Submit Results template](.github/ISSUE_TEMPLATE/submit-results.md)
169-
3. Attach your JSON output
199+
3. Attach your JSON output and recommended evidence pack
170200

171201
All submitted results are published with full transparency. No editorial gatekeeping.
172202

203+
Recommended evidence pack:
204+
205+
- result JSON
206+
- Markdown or HTML report
207+
- runtime version or commit SHA
208+
- harness commit SHA
209+
- diagnostics snapshot
210+
- metrics snapshot
211+
- incident notes when the runtime or upstream model path degraded during the run
212+
- optional run manifest using [docs/run-manifest-v0.2.example.json](docs/run-manifest-v0.2.example.json)
213+
173214
---
174215

175216
## Repository Structure
176217

177218
```
178219
DTaaS-benchmark/
179220
├── README.md You are here
221+
├── CHANGELOG.md Release history
180222
├── SPECIFICATION.md Full benchmark spec (10 dimensions, scoring, rules)
181223
├── CONTRIBUTING.md How to contribute
224+
├── docs/
225+
│ ├── TRUST_MODEL.md Verification rules and evidence tiers
226+
│ ├── NULLALIS_REFERENCE_RUNTIME.md
227+
│ ├── OPEN_SOURCE_RELEASE_CHECKLIST.md
228+
│ └── run-manifest-v0.2.example.json
182229
├── LICENSE Apache-2.0
183230
├── harness/
184231
│ ├── runner.py CLI entry point
@@ -204,6 +251,8 @@ DTaaS-benchmark/
204251

205252
[SPECIFICATION.md](SPECIFICATION.md) contains everything: dimension definitions, test protocols, scoring formulas, rules for fair comparison, and the complete methodology.
206253

254+
If you want the short version of what counts as a trustworthy result, start with [docs/TRUST_MODEL.md](docs/TRUST_MODEL.md).
255+
207256
---
208257

209258
## License
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
{"error":"unauthorized"}

artifacts/nullalis-diagnostics-post-rerun.json

Whitespace-only changes.

artifacts/nullalis-diagnostics-post.json

Whitespace-only changes.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
{"gateway":{"requests_total":461,"in_flight_requests":1,"drain_rejected_total":0,"overload_rejected_total":0,"draining":false},"runtime_mode":"threaded","internal_auth_required":false,"internal_token_configured":true,"internal_token_policy_ok":true,"internal_token_policy_reason":null,"instance_id":"pid-65213-1774391416","owned_users_count":0,"tenant_lock_backend":"postgres_lease","tenant_lock_lease_secs":90,"tenant_lock_wait_ms":500,"tenant_lock_retry_min_ms":25,"tenant_lock_retry_max_ms":80,"tenant_lock_conflict_retries_total":0,"tenant_runtime_policy_attached":true,"effective_config_source":"postgres_user_config","effective_config_hash":"cdc96b6f7af70003","memory_search_enabled":true,"memory_summarizer_enabled":true,"provider_retries":2,"fallback_provider_count":1,"memory_vector_sync_mode":"best_effort","memory_outbox_enabled":false,"configured_config_source":"postgres_user_config","configured_memory_search_enabled":true,"configured_memory_summarizer_enabled":true,"configured_provider_retries":2,"configured_fallback_provider_count":1,"configured_memory_vector_sync_mode_requested":"best_effort","configured_memory_outbox_requested":false,"control_plane":{"configured_config_source":"postgres_user_config","effective_config_source":"postgres_user_config","configured_config_hash":"cdc96b6f7af70003","effective_config_hash":"cdc96b6f7af70003","configured_assistant_mode":"balanced","effective_assistant_mode":"balanced","configured_ignored_tenant_override_count":4,"effective_ignored_tenant_override_count":4,"controls":{"agent.parallel_tools":{"configured":true,"effective":true,"owner":"operator","source":"postgres_user_config","drift":false},"agent.parallel_tools_rollout_percent":{"configured":100,"effective":100,"owner":"operator","source":"postgres_user_config","drift":false},"agent.tool_dispatcher":{"configured":"parallel","effective":"parallel","owner":"operator","source":"postgres_user_config","drift":false},"memory.reliability.rollout_mode":{"configured":"on","effective":"on","owner":"operator","source":"postgres_user_config","drift":false},"memory.reliability.shadow_hybrid_percent":{"configured":0,"effective":0,"owner":"operator","source":"postgres_user_config","drift":false},"memory.reliability.canary_hybrid_percent":{"configured":0,"effective":0,"owner":"operator","source":"postgres_user_config","drift":false},"memory.reliability.fallback_policy":{"configured":"degrade","effective":"degrade","owner":"operator","source":"postgres_user_config","drift":false},"memory.search.enabled":{"configured":true,"effective":true,"owner":"operator","source":"postgres_user_config","drift":false},"memory.search.sync.mode":{"configured":"best_effort","effective":"best_effort","owner":"operator","source":"postgres_user_config","drift":false},"memory.summarizer.enabled":{"configured":true,"effective":true,"owner":"operator","source":"postgres_user_config","drift":false},"memory.summarizer.window_size_tokens":{"configured":4000,"effective":4000,"owner":"operator","source":"postgres_user_config","drift":false},"memory.summarizer.summary_max_tokens":{"configured":500,"effective":500,"owner":"operator","source":"postgres_user_config","drift":false},"memory.summarizer.auto_extract_semantic":{"configured":true,"effective":true,"owner":"operator","source":"postgres_user_config","drift":false},"tenant.identity_mapping_enforcement":{"configured":"staged_strict","effective":"staged_strict","owner":"operator","source":"postgres_user_config","drift":false},"tenant.identity_mapping_strict_channels":{"configured":["telegram"],"effective":["telegram"],"owner":"operator","source":"postgres_user_config","drift":false},"gateway.require_explicit_chat_stream_session_key":{"configured":true,"effective":true,"owner":"operator","source":"postgres_user_config","drift":false},"session.cross_channel_shared_main":{"configured":false,"effective":false,"owner":"operator","source":"postgres_user_config","drift":false}}},"sandbox_workspace_validation_failed_total":0,"sandbox_fallback_none_total":0,"sandbox_workspace_validation_last_reason":"none","chat_stream_require_explicit_session_key":true,"chat_stream_lane_counts":{"main":1,"thread":164,"task":0,"cron":0},"chat_stream_session_key_rejections":{"missing":0,"invalid":0,"wrong_user":0,"invalid_lane":0},"background_main_reroutes_total":0,"background_main_reroutes_last_job_id":null,"tenant_lock_conflicts_by_route":{"chat_stream_sse":0,"chat_stream_http":0,"webhook":0,"daemon":0,"api":0},"tenant_lease_probe":{"user_id":"1","data_source":"postgres_lease","owner_id":null,"lease_until_s":null,"updated_at_s":null},"bus":{"inbound_len":0,"outbound_len":0,"capacity":1024},"pool":{"hits":0,"misses":0,"idle":0},"startup_self_check":{"config_path":"/Users/nova/.nullalis/config.json","state_backend_configured":"postgres","state_backend_effective":"postgres","heartbeat_enabled":true,"heartbeat_interval_minutes":60,"tenant_enabled":true,"degraded":false,"degraded_reason":"","postgres_host":"127.0.0.1","postgres_port":5432,"postgres_schema":"zaki_bot","scheduler_backend":"postgres","webhook_mode":"none","chat_provider_effective":"together","chat_fallback_chain":"openrouter","embedding_provider_effective":"together","provider_data_source":"config"},"heartbeat_wake":{"pending":0,"dropped_total":0,"coalesced_total":0},"last_trigger":null,"heartbeat_runtime":{"user_id":"1","available":true,"last_run_s":1774394928,"last_status":"idle","last_reason":"no_actionable_output"},"integrations":{"telegram":{"configured":true,"connected":true,"account_id":"default","chat_id":1110331014,"data_source":"postgres"}},"identity_mapping":{"mapped":0,"unmapped":0,"strict_rejected":0,"degraded_compat":0,"cache_hit":0,"cache_miss":0,"cache_stale":0,"db_lookup_count":0,"db_lookup_ms_total":0},"stt":{"transcriber_configured":0,"transcription_attempted":0,"transcription_succeeded":0,"transcription_failed":0,"transcription_skipped_no_transcriber":0,"failure_get_file":0,"failure_download":0,"failure_transcriber":0,"failure_empty_transcript":0},"multimodal":{"image_markers_detected":0,"messages_with_image_markers":0,"image_parts_prepared":0,"image_parts_failed":0,"image_markers_ignored":0},"ops":{"proactive_sent_total":0,"proactive_blocked_rate_total":0,"proactive_blocked_dedupe_total":0,"proactive_send_errors_total":0,"scheduler_executed_total":0,"scheduler_blocked_burst_total":0,"scheduler_blocked_cooldown_total":0,"proactive_policy":{"dedupe_window_secs":120,"rate_window_secs":300,"rate_limit_per_window":12},"last_event":null,"recent_events":[]}}

0 commit comments

Comments
 (0)