NMR-Challenge for LLMs: Evaluating Chemical Reasoning in Humans and AI

🔍 Overview

LLM-NMR is a framework for evaluating the potential of large language models (LLMs) for solving Nuclear Magnetic Resonance (NMR) spectral analysis tasks through reasoning and domain knowledge.

Key Contributions

This repo includes:

Benchmark Dataset: 112 NMR problems from Easy, Medium, and Hard difficulty levels sourced from NMR-Challenge.com
Comprehensive LLM Evaluation: An Inference script for running benchmark tasks on all LLMs with public-facing APIs from OpenAI, Anthropic, and Google
Automated grading: Tools for grading model outputs using SMILES, Tanimoto comparison, performance, and scoring
Systematic Analysis: Experimental configuration for analysing effects of temperature, prompting strategies, reasoning effort, and molecular formula inclusion.

Quick Start

git clone https://github.com/ATOMSLab/LLM-NMR.git
cd LLM-NMR
pip install -r requirements.txt

Required packages: pandas, numpy, rdkit, cirpy, openai, anthropic, google-generativeai, json, matplotlib, seaborn

Basic Usage

1. Configure `.env`

Add your API keys:

OPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
GOOGLE_API_KEY=...

2. Edit `config.py`

Set:

Path to dataset JSON files
Output directory for inference and grading
Model settings (name, temperature,inference function to call etc.)

🚀 Run Benchmark

Step 1: Run Inference

python3 main.py

Generates output CSV (configurable path), e.g.:

Id,Formula,Prediction
191,C4H8O2,"### Scratchpad ###...### Start answer ###Ethyl acetate### End answer ###"

Step 2: Grade Outputs

python3 grade.py

This will:

Extract final answer from LLM output
Convert chemical names to SMILES using CIRpy
Calculate Tanimoto similarity using RDKit against correct reference
Score: 1 for exact match (similarity = 1.0), 0 otherwise
Aggregate results for performance scores accross analytical dimensions

📁 Dataset

unzip JSON dataset from the zip folder
Place it in the root directory (or update path in config.py)
answer_keys.zip contains the answer key files for both of our benchmarks, and is password-protected with this password: Llmhack904.

Composition:

Difficulty	Count
Easy	53
Medium	38
Hard	24

🧪 Models Tested

Model	Provider	Type
GPT-4o	OpenAI	Standard
GPT-4o mini	OpenAI	Standard
o1	OpenAI	Reasoning
o1-mini	OpenAI	Reasoning
o3-mini	OpenAI	Reasoning
Claude-3.5 Sonnet	Anthropic	Standard
Gemini-2.0-Flash	Google	Standard

⚗️ Experiment Variables

prompting strategies

Strategy	Description
P1	Minimal instruction
P2	Chain-of-Thought (CoT)
P3	CoT + domain logic
P4	CoT + expert NMR tips
P5	CoT + knowledge + logic (full)

Temperature: 0.0, 0.5, 0.8, 1.0
Formula Inclusion: with/without molecular_formula
Difficulty Tier: Easy / Medium / Hard

📈 Results (With Formula)

Model	Accuracy	Rank
o1	69%	🥇
o3-mini	65%	🥈
Claude-3.5 Sonnet	51%	🥉
Gemini-2.0-Flash	38%	4th
GPT-4o	25%	5th
o1-mini	30%	6th
GPT-4o mini	10%	7th

📚 Citation

If you use this work in your research, please cite:

@article{llm_spectroscopy_2025,
  title={NMR-Challenge for LLMs: Evaluating Chemical Reasoning in Humans and AI},
  author  = {Sharlin, Samiha and Agbere, Fariha and Ishimwe, Kevin and Osifov{\'a}, Zuzana and Socha, Ond{\v r}ej and Dra{\v c}{\'\i}nsk{\'y}, Martin and Josephson, Tyler},
  journal = {ChemRxiv},
  year    = {2025},
  doi     = {10.26434/chemrxiv-2025-x8h36-v2},
}

🔗 Links

🔬 NMR-Challenge.com – source of benchmark problems
📂 Dataset Zip – (JSON dataset on Google Zip folder )
🧪 RDKit Documentation – for SMILES handling and similarity metrics

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
analysis results		analysis results
human comparison prompts		human comparison prompts
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
answer_keys.zip		answer_keys.zip
bg.png		bg.png
config.py		config.py
datasets.zip		datasets.zip
example_output.txt		example_output.txt
glmm_output_SI.txt		glmm_output_SI.txt
grade.py		grade.py
helper.py		helper.py
llms.py		llms.py
main.py		main.py
prompts_HNMR.py		prompts_HNMR.py
requirements.txt		requirements.txt
results.zip		results.zip
smile.py		smile.py
tanimoto_similarity.py		tanimoto_similarity.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NMR-Challenge for LLMs: Evaluating Chemical Reasoning in Humans and AI

🔍 Overview

Key Contributions

Quick Start

Basic Usage

1. Configure `.env`

2. Edit `config.py`

🚀 Run Benchmark

Step 1: Run Inference

Step 2: Grade Outputs

📁 Dataset

Composition:

🧪 Models Tested

⚗️ Experiment Variables

📈 Results (With Formula)

📚 Citation

🔗 Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NMR-Challenge for LLMs: Evaluating Chemical Reasoning in Humans and AI

🔍 Overview

Key Contributions

Quick Start

Basic Usage

1. Configure .env

2. Edit config.py

🚀 Run Benchmark

Step 1: Run Inference

Step 2: Grade Outputs

📁 Dataset

Composition:

🧪 Models Tested

⚗️ Experiment Variables

📈 Results (With Formula)

📚 Citation

🔗 Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Configure `.env`

2. Edit `config.py`

Packages