Reproducible artifacts for the headline benchmark claims about DNAi Systems' medical AI agent Asha (live at askasha.org). Each published number is recomputed from per-turn or per-item records by a self-contained script in this repo. The upstream dataset is SHA-256 locked.
Preliminary results; full benchmarking and publications pending industrial product testing.
| Benchmark | Result | Config | Source artifact |
|---|---|---|---|
| Psychosis-bench HES (harm enablement; lower safer) | 0.000 (POST) vs 0.064 (PRE) | 16 scenarios / 192 turns, judge gpt-4o-mini, PRE vs POST same bench |
psychosis-bench-2026-06-12/results/aggregate_POST.json, aggregate_PRE.json |
| Psychosis-bench SIS (safe-intervention rate; higher safer) | 1.000 (96/96, POST) vs 0.979 (92/94, PRE) | same as above | same |
| Psychosis-bench DCS (delusion confirmation; lower safer) | 0.069 (POST) vs 0.085 (PRE) | same as above | same |
| HealthBench (physician-graded clinical quality) | 66.4% +/- 0.4% | live engine, 4,996 of 5,000-q set, 57,185 rubrics | healthbench-2026-06-09/results/healthbench_summary.json |
| KaBLE theory-of-mind | 97.7% (215/220) | live engine, 6 ToM tasks, 40/task (Recursive 20) | kable-tom-2026-06-11/results/kable_per_task.json |
Every value above is reproduced from the published records by the scripts in reproduce.sh.
The psychosis-bench PRE and POST arms run the same backbone language model. PRE is the baseline engine; POST is Asha 2.0 with an added containment layer. On the identical 192-turn bench under the same judge, harm enablement falls from 0.064 to 0.000 and the safe-intervention rate rises from 0.979 to 1.000. The behavior delta is attributed to the cognition stack as a black box. The containment layer's internals are out of scope for this repo; the published claim is the measured effect.
- The psychosis PRE/POST comparison is a single-judge, single-run measurement on identical scenarios, reported as an engineering result. The dual-judge reliability caveat from the earlier evaluation is carried forward honestly in
PREREGISTRATION/README.md: the literal kappa gate failed (verdict INCONCLUSIVE), both judges agreed on direction, and the within-experiment comparison uses the protocol-specified judge. - HealthBench and KaBLE are each a single live-engine configuration and run, labeled with exact config and honest n. KaBLE here is the live-engine forced-choice accuracy on 6 ToM tasks; it is not a zero-LLM symbolic-only measurement and not a 13-task full-set run.
- No cross-paper baseline comparison is made.
- The psychiatric-safety effect is described as conversational psychiatric-safety / delusion-confirmation reduction. "Antipsychotic" is not used.
README.md this file
LICENSE Apache 2.0 (analysis scripts + computed reports)
reproduce.sh recompute every published number
PREREGISTRATION/ pre-registration carry-forward and disclosure discipline
psychosis-bench-2026-06-12/ PRE vs POST safety bench
healthbench-2026-06-09/ physician-graded clinical quality
kable-tom-2026-06-11/ theory-of-mind
figures/ reserved
bash reproduce.shOr per benchmark:
# Psychosis-bench
python3 psychosis-bench-2026-06-12/scripts/verify_scenarios_sha.py
python3 psychosis-bench-2026-06-12/scripts/compute_stats.py
# KaBLE
python3 kable-tom-2026-06-11/scripts/compute_kable.py
# HealthBench
python3 healthbench-2026-06-09/scripts/verify_summary.py| Dataset | SHA-256 | Bytes |
|---|---|---|
| Au Yeung 2025 psychosis-bench scenarios | d9b7820c0bebb6ec845e5825378535e8e35b79b0244a72be64ec5e49d8da439f |
30,544 |
- Every number reproduces from input SHA + per-turn / per-item records + an analysis script in this repo. No hidden post-processing.
- Honest n. HealthBench is reported on the 4,996 questions that completed grading out of a 5,000-question target.
- Honest caveats. The psychosis-bench kappa-gate failure is stated, not buried.
- Preprint discipline. Upstream Au Yeung 2025 is named as a preprint.
Asha's architecture is the subject of filed and pending intellectual property. Public-safe references only:
- Cognitive inferencing architecture for LLMs (allowed US non-provisional, 19/290,471).
- Continuation broadening the allowed cognitive-inferencing claims (1766-001USC1, filed Feb 10, 2026).
- Lawful-axiom framework for knowledge governance (provisional, pending).
- Cryptographically auditable knowledge-unit architecture (provisional, pending).
- Deterministic post-emission correction for regulated structured outputs (provisional, pending).
Au Yeung, J., Dalmasso, J., Foschini, L., Dobson, R. J. B., & Kraljevic, Z. (2025). The Psychogenic Machine: Simulating AI Psychosis, Delusion Reinforcement and Harm Enablement in Large Language Models. arXiv:2509.10970v2 [preprint, not peer-reviewed].