MindEval: Benchmarking Language Models on Multi-turn Mental Health Support.
git clone https://github.com/SWORDHealth/mind-eval.git
cd mind-eval
python -m venv venv
source venv/bin/activate
poetry installPoetry version 2.1.4 ; Python 3.10.12.
Make sure credentials to gcloud are set up.
IMPORTANT: edit CUSTOM_CLINICIAN_INTERACTION_TEMPLATE at the end of mindeval/prompts.py if you want to use a custom clinician system prompt template. To ensure a fair comparison with other models, only the arguments in MINDEVAL_CLINICIAN_TEMPLATE are allowed in any template.
Assuming you want to evaluate a clinician model with a custom system template on the default mindeval setting.
python mindeval/scripts/generate_interactions.py \
--profiles_path data/profiles.jsonl \
--clinician_system_template_version custom \
--clinician_model_api_params <YOUR_MODEL_API_PARAMS> \
--member_system_template_version v0_2 \
--member_model_api_params "{'model':'PROVIDER/claude-haiku-4-5@20251001','max_completion_tokens':16000,'reasoning_effort':'high','api_base':'API_BASE'}" \
--n_turns 10 \
--max_workers 50 \
--output_path interactions.jsonlCopy the syntax of
--member_model_api_paramsfor your clinician model.
If you want to replicate the clinician setup for general-purpose models, set
--clinician_system_template_version v0_1.
Adjust
--max_workersaccording to rate limits.
python mindeval/scripts/generate_judgments.py \
--interactions_path interactions.jsonl \
--judge_template_version v0_1 \
--judge_model_api_params "{'model':'PROVIDER/claude-sonnet-4-5@20250929','max_completion_tokens':16000,'reasoning_effort':'high','api_base':'API_BASE'}" \
--max_workers 50 \
--output_path judgments.jsonlimport pandas as pd
from mindeval.utils import load_jsonl
full_judgments = load_jsonl("judgments.jsonl")
float_judgments = [j["parsed_judgment"] for j in full_judgments]
df = pd.DataFrame(float_judgments)
print(df.mean())bash run_benchmark.sh <YOUR_CLINICIAN_MODEL_API_PARAMS>Human data for patient realism and judge meta-evaluation is available.
@misc{pombal2025mindevalbenchmarkinglanguagemodels,
title={MindEval: Benchmarking Language Models on Multi-turn Mental Health Support},
author={José Pombal and Maya D'Eon and Nuno M. Guerreiro and Pedro Henrique Martins and António Farinhas and Ricardo Rei},
year={2025},
eprint={2511.18491},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.18491},
}