Auditing the clinical-evidence claims in OpenAI's HealthBench — its own gold answers & rubrics — for hallucinated, overgeneralized, overlooked & misweighted evidence. By NoBSmed.
benchmark openai clinical-trials ai-safety hallucination rag evidence-based-medicine medical-ai evals llm-evaluation citation-verification healthbench clinical-evidence medical-ai-evaluation applicability-benchmark
-
Updated
Jun 9, 2026 - Python