You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Single-file, zero-dependency local LLM benchmark suite + live dashboard. Deterministic scoring (Wilson CIs, paired McNemar A/B), per-question drill-down & live system metrics — for any OpenAI-compatible server (llama.cpp/vLLM/Ollama/LM Studio/ds4). Broad + medical benchmark panel.
his repository contains code and resources related to an in-depth analysis of OpenAI's HealthBench, a benchmark designed for evaluating Large Language Models in the healthcare sector.
Auditing the clinical-evidence claims in OpenAI's HealthBench — its own gold answers & rubrics — for hallucinated, overgeneralized, overlooked & misweighted evidence. By NoBSmed.