applicability-benchmark

Here is 1 public repository matching this topic...

borisdev / nobsmed-healthbench-audit

Auditing the clinical-evidence claims in OpenAI's HealthBench — its own gold answers & rubrics — for hallucinated, overgeneralized, overlooked & misweighted evidence. By NoBSmed.

benchmark openai clinical-trials ai-safety hallucination rag evidence-based-medicine medical-ai evals llm-evaluation citation-verification healthbench clinical-evidence medical-ai-evaluation applicability-benchmark

Updated Jun 9, 2026
Python

Improve this page

Add a description, image, and links to the applicability-benchmark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the applicability-benchmark topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

applicability-benchmark

Here is 1 public repository matching this topic...

borisdev / nobsmed-healthbench-audit

Improve this page

Add this topic to your repo