Evaluation Framework for Mental Health Chatbots

Overview

This repository contains the implementation of the evaluation framework as described in our paper available at arXiv. This framework aims to assess the safety and reliability of mental health chatbots using a combination of expert-driven benchmarks and automated evaluation methods.

Abstract

This study develops and validates an evaluation framework for mental health chatbots, focusing on their safety and reliability due to their increasing popularity and capability for context-aware support. The framework includes 100 benchmark questions with ideal responses, validated by mental health experts, and tested on a GPT-3.5-turbo-based chatbot. Evaluation methods explored include large language model-based scoring, an agentic approach, and embedding models. Our results underline the importance of expert validation and the effectiveness of real-time data access in improving chatbot reliability.

File Descriptions

answer.py - Runs the benchmark questions through GPT-3.5-turbo and stores the results in a CSV file.
evaluate.py - Evaluates GPT-3.5-turbo answers using three LLM-based evaluation methods and stores results in separate CSV files.
evaluate_similarity_vector.py - Computes vector similarity for GPT-3.5-turbo answers.
agent_method_code - (To be added) Implements the agentic approach for real-time data evaluation.
plot.py - Generates plots as presented in the paper.
prepare_results.py - Extracts scores from LLM-based evaluation outputs and stores them in separated CSV files.
check_normal.py - Checks the normality of experts' scores against automated evaluation scores to determine appropriate statistical tests.
ttest.py - Runs paired t-tests between expert scores and other evaluation methods.
Safety_Benchmark_Mental Health -Sheet1new.csv - Contains the benchmark title, situation, question, and ideal response.
Safety_Benchmark_Mental Health-ChatGPT 3.5.csv - Stores GPT-3.5-turbo's answers to the benchmark questions.
Results.xlsx - Compiled final results for all evaluation methods.
requirements.txt - Contains requirements needed to run the codes.

Setup and Running

Requirements

Python 3.x
Libraries: the list of requirements can be found in requirements.txt

To install dependencies please run the following in your commandline:

pip install -r requirements.txt

Citation

If you find this framework useful in your research, please consider citing our paper:

@article{park2024building,
  title={Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools},
  author={Park, Jung In and Abbasian, Mahyar and Azimi, Iman and Bounds, Dawn and Jun, Angela and Han, Jaesu and McCarron, Robert and Borelli, Jessica and Li, Jia and Mahmoudi, Mona and others},
  journal={arXiv preprint arXiv:2408.04650},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation Framework for Mental Health Chatbots

Overview

Abstract

File Descriptions

Setup and Running

Requirements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
Result.xlsx		Result.xlsx
Safety_Benchmark_Mental Health -Sheet1new.csv		Safety_Benchmark_Mental Health -Sheet1new.csv
Safety_Benchmark_Mental Health-ChatGPT 3.5.csv		Safety_Benchmark_Mental Health-ChatGPT 3.5.csv
answer.py		answer.py
check_normal.py		check_normal.py
evaluate.py		evaluate.py
evaluate_similarity_vector.py		evaluate_similarity_vector.py
llm.py		llm.py
plot.py		plot.py
prepare_results.py		prepare_results.py
requirements.txt		requirements.txt
ttest.py		ttest.py

Folders and files

Latest commit

History

Repository files navigation

Evaluation Framework for Mental Health Chatbots

Overview

Abstract

File Descriptions

Setup and Running

Requirements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages