LLM-NMR is a framework for evaluating the potential of large language models (LLMs) for solving Nuclear Magnetic Resonance (NMR) spectral analysis tasks through reasoning and domain knowledge.
This repo includes:
- Benchmark Dataset: 112 NMR problems from Easy, Medium, and Hard difficulty levels sourced from NMR-Challenge.com
- Comprehensive LLM Evaluation: An Inference script for running benchmark tasks on all LLMs with public-facing APIs from OpenAI, Anthropic, and Google
- Automated grading: Tools for grading model outputs using SMILES, Tanimoto comparison, performance, and scoring
- Systematic Analysis: Experimental configuration for analysing effects of temperature, prompting strategies, reasoning effort, and molecular formula inclusion.
git clone https://github.com/ATOMSLab/LLM-NMR.git
cd LLM-NMR
pip install -r requirements.txtRequired packages:
pandas, numpy, rdkit, cirpy, openai, anthropic, google-generativeai, json, matplotlib, seaborn
Add your API keys:
OPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
GOOGLE_API_KEY=...Set:
- Path to dataset JSON files
- Output directory for inference and grading
- Model settings (name, temperature,inference function to call etc.)
python3 main.pyGenerates output CSV (configurable path), e.g.:
Id,Formula,Prediction
191,C4H8O2,"### Scratchpad ###...### Start answer ###Ethyl acetate### End answer ###"python3 grade.pyThis will:
- Extract final answer from LLM output
- Convert chemical names to SMILES using CIRpy
- Calculate Tanimoto similarity using RDKit against correct reference
- Score: 1 for exact match (similarity = 1.0), 0 otherwise
- Aggregate results for performance scores accross analytical dimensions
- unzip JSON dataset from the zip folder
- Place it in the root directory (or update path in
config.py) - answer_keys.zip contains the answer key files for both of our benchmarks, and is password-protected with this password: Llmhack904.
| Difficulty | Count |
|---|---|
| Easy | 53 |
| Medium | 38 |
| Hard | 24 |
| Model | Provider | Type |
|---|---|---|
| GPT-4o | OpenAI | Standard |
| GPT-4o mini | OpenAI | Standard |
| o1 | OpenAI | Reasoning |
| o1-mini | OpenAI | Reasoning |
| o3-mini | OpenAI | Reasoning |
| Claude-3.5 Sonnet | Anthropic | Standard |
| Gemini-2.0-Flash | Standard |
- prompting strategies
| Strategy | Description |
|---|---|
| P1 | Minimal instruction |
| P2 | Chain-of-Thought (CoT) |
| P3 | CoT + domain logic |
| P4 | CoT + expert NMR tips |
| P5 | CoT + knowledge + logic (full) |
- Temperature:
0.0,0.5,0.8,1.0 - Formula Inclusion: with/without
molecular_formula - Difficulty Tier: Easy / Medium / Hard
| Model | Accuracy | Rank |
|---|---|---|
| o1 | 69% | 🥇 |
| o3-mini | 65% | 🥈 |
| Claude-3.5 Sonnet | 51% | 🥉 |
| Gemini-2.0-Flash | 38% | 4th |
| GPT-4o | 25% | 5th |
| o1-mini | 30% | 6th |
| GPT-4o mini | 10% | 7th |
If you use this work in your research, please cite:
@article{llm_spectroscopy_2025,
title={NMR-Challenge for LLMs: Evaluating Chemical Reasoning in Humans and AI},
author = {Sharlin, Samiha and Agbere, Fariha and Ishimwe, Kevin and Osifov{\'a}, Zuzana and Socha, Ond{\v r}ej and Dra{\v c}{\'\i}nsk{\'y}, Martin and Josephson, Tyler},
journal = {ChemRxiv},
year = {2025},
doi = {10.26434/chemrxiv-2025-x8h36-v2},
}- 🔬 NMR-Challenge.com – source of benchmark problems
- 📂 Dataset Zip – (JSON dataset on Google Zip folder )
- 🧪 RDKit Documentation – for SMILES handling and similarity metrics
