Asha 2.0: public benchmark results

Reproducible artifacts for the headline benchmark claims about DNAi Systems' medical AI agent Asha (live at askasha.org). Each published number is recomputed from per-turn or per-item records by a self-contained script in this repo. The upstream dataset is SHA-256 locked.

Preliminary results; full benchmarking and publications pending industrial product testing.

Headline

Benchmark	Result	Config	Source artifact
Psychosis-bench HES (harm enablement; lower safer)	0.000 (POST) vs 0.064 (PRE)	16 scenarios / 192 turns, judge `gpt-4o-mini`, PRE vs POST same bench	`psychosis-bench-2026-06-12/results/aggregate_POST.json`, `aggregate_PRE.json`
Psychosis-bench SIS (safe-intervention rate; higher safer)	1.000 (96/96, POST) vs 0.979 (92/94, PRE)	same as above	same
Psychosis-bench DCS (delusion confirmation; lower safer)	0.069 (POST) vs 0.085 (PRE)	same as above	same
HealthBench (physician-graded clinical quality)	66.4% +/- 0.4%	live engine, 4,996 of 5,000-q set, 57,185 rubrics	`healthbench-2026-06-09/results/healthbench_summary.json`
KaBLE theory-of-mind	97.7% (215/220)	live engine, 6 ToM tasks, 40/task (Recursive 20)	`kable-tom-2026-06-11/results/kable_per_task.json`

Every value above is reproduced from the published records by the scripts in reproduce.sh.

Same backbone LM, different behavior

The psychosis-bench PRE and POST arms run the same backbone language model. PRE is the baseline engine; POST is Asha 2.0 with an added containment layer. On the identical 192-turn bench under the same judge, harm enablement falls from 0.064 to 0.000 and the safe-intervention rate rises from 0.979 to 1.000. The behavior delta is attributed to the cognition stack as a black box. The containment layer's internals are out of scope for this repo; the published claim is the measured effect.

What is and is not claimed

The psychosis PRE/POST comparison is a single-judge, single-run measurement on identical scenarios, reported as an engineering result. The dual-judge reliability caveat from the earlier evaluation is carried forward honestly in PREREGISTRATION/README.md: the literal kappa gate failed (verdict INCONCLUSIVE), both judges agreed on direction, and the within-experiment comparison uses the protocol-specified judge.
HealthBench and KaBLE are each a single live-engine configuration and run, labeled with exact config and honest n. KaBLE here is the live-engine forced-choice accuracy on 6 ToM tasks; it is not a zero-LLM symbolic-only measurement and not a 13-task full-set run.
No cross-paper baseline comparison is made.
The psychiatric-safety effect is described as conversational psychiatric-safety / delusion-confirmation reduction. "Antipsychotic" is not used.

Layout

README.md                          this file
LICENSE                            Apache 2.0 (analysis scripts + computed reports)
reproduce.sh                       recompute every published number
PREREGISTRATION/                   pre-registration carry-forward and disclosure discipline
psychosis-bench-2026-06-12/        PRE vs POST safety bench
healthbench-2026-06-09/            physician-graded clinical quality
kable-tom-2026-06-11/              theory-of-mind
figures/                           reserved

Reproduce

bash reproduce.sh

Or per benchmark:

# Psychosis-bench
python3 psychosis-bench-2026-06-12/scripts/verify_scenarios_sha.py
python3 psychosis-bench-2026-06-12/scripts/compute_stats.py

# KaBLE
python3 kable-tom-2026-06-11/scripts/compute_kable.py

# HealthBench
python3 healthbench-2026-06-09/scripts/verify_summary.py

SHA locks

Dataset	SHA-256	Bytes
Au Yeung 2025 psychosis-bench scenarios	`d9b7820c0bebb6ec845e5825378535e8e35b79b0244a72be64ec5e49d8da439f`	30,544

Reproducibility contract

Every number reproduces from input SHA + per-turn / per-item records + an analysis script in this repo. No hidden post-processing.
Honest n. HealthBench is reported on the 4,996 questions that completed grading out of a 5,000-question target.
Honest caveats. The psychosis-bench kappa-gate failure is stated, not buried.
Preprint discipline. Upstream Au Yeung 2025 is named as a preprint.

Patent status

Asha's architecture is the subject of filed and pending intellectual property. Public-safe references only:

Cognitive inferencing architecture for LLMs (allowed US non-provisional, 19/290,471).
Continuation broadening the allowed cognitive-inferencing claims (1766-001USC1, filed Feb 10, 2026).
Lawful-axiom framework for knowledge governance (provisional, pending).
Cryptographically auditable knowledge-unit architecture (provisional, pending).
Deterministic post-emission correction for regulated structured outputs (provisional, pending).

Citation

Au Yeung, J., Dalmasso, J., Foschini, L., Dobson, R. J. B., & Kraljevic, Z. (2025). The Psychogenic Machine: Simulating AI Psychosis, Delusion Reinforcement and Harm Enablement in Large Language Models. arXiv:2509.10970v2 [preprint, not peer-reviewed].

Contact

founders@dnai.systems

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Asha 2.0: public benchmark results

Headline

Same backbone LM, different behavior

What is and is not claimed

Layout

Reproduce

SHA locks

Reproducibility contract

Patent status

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
PREREGISTRATION		PREREGISTRATION
figures		figures
healthbench-2026-06-09		healthbench-2026-06-09
kable-tom-2026-06-11		kable-tom-2026-06-11
psychosis-bench-2026-06-12		psychosis-bench-2026-06-12
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
reproduce.sh		reproduce.sh

Folders and files

Latest commit

History

Repository files navigation

Asha 2.0: public benchmark results

Headline

Same backbone LM, different behavior

What is and is not claimed

Layout

Reproduce

SHA locks

Reproducibility contract

Patent status

Citation

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages