JMLE2026-Bench (IgakuQA120)

LLM benchmark on the 120th Japanese Medical Licensing Examination (Feb 7-8, 2026).

400 questions (302 text-only + 98 with clinical images) structured as JSON with ground-truth answers.

Previous year (119th): IgakuQA119 (Feb 8-9, 2025)

Interactive Viewer: https://jmle2026-bench.streamlit.app

Leaderboard

All (400 Questions)

Model	Vision	Score	Accuracy
Claude Opus 4.6	✓	493/500 (98.6%)	393/400 (98.2%)
Gemini 3.1 Pro Preview	✓	493/500 (98.6%)	393/400 (98.2%)
Claude Sonnet 4.6	✓	489/500 (97.8%)	391/400 (97.8%)
GPT-5.2	✓	486/500 (97.2%)	386/400 (96.5%)
Qwen3.5-35B-A3B	✓	480/500 (96.0%)	380/400 (95.0%)
Qwen3.5-397B-A17B	✓	480/500 (96.0%)	382/400 (95.5%)
Qwen3.5-122B-A10B	✓	479/500 (95.8%)	381/400 (95.2%)
Qwen3.5-27B	✓	475/500 (95.0%)	381/400 (95.2%)
GPT-OSS-Swallow-120B-RL-v0.1	-	473/500 (94.6%)	379/400 (94.8%)
gpt-oss-120b (high)	-	468/500 (93.6%)	374/400 (93.5%)
Qwen3.5-27B (no-think)	✓	463/500 (92.6%)	365/400 (91.2%)
Qwen3-Swallow-32B-RL-v0.2	-	459/500 (91.8%)	365/400 (91.2%)
Qwen3.5-122B-A10B (no-think)	✓	458/500 (91.6%)	366/400 (91.5%)
Qwen3.5-35B-A3B (no-think)	✓	455/500 (91.0%)	365/400 (91.2%)
gpt-oss-120b (medium)	-	453/500 (90.6%)	363/400 (90.8%)
Qwen3-32B	-	449/500 (89.8%)	353/400 (88.2%)
Preferred-MedRECT-32B	-	449/500 (89.8%)	357/400 (89.2%)
Qwen3.5-9B	✓	436/500 (87.2%)	346/400 (86.5%)
gpt-oss-120b (low)	-	434/500 (86.8%)	344/400 (86.0%)
gpt-oss-20b (medium)	-	429/500 (85.8%)	343/400 (85.8%)
gpt-oss-20b (high)	-	425/500 (85.0%)	337/400 (84.2%)
Qwen3-32B (no-think)	-	417/500 (83.4%)	327/400 (81.8%)
Qwen3.5-9B (no-think)	✓	415/500 (83.0%)	337/400 (84.2%)
Qwen3.5-4B	✓	409/500 (81.8%)	323/400 (80.8%)
gpt-oss-20b (low)	-	386/500 (77.2%)	306/400 (76.5%)
Qwen3.5-4B (no-think)	✓	370/500 (74.0%)	292/400 (73.0%)
Qwen3.5-2B (no-think)	✓	243/500 (48.6%)	191/400 (47.8%)
Qwen3.5-2B	✓	195/500 (39.0%)	147/400 (36.8%)
Qwen3.5-0.8B (no-think)	✓	160/500 (32.0%)	126/400 (31.5%)
Qwen3.5-0.8B	✓	156/500 (31.2%)	112/400 (28.0%)

Bold = passes both required (160/200) and general (224/300) thresholds based on the official criteria.

Text-only (302 Questions)

Model	Score	Accuracy
Claude Opus 4.6	380/382 (99.5%)	300/302 (99.3%)
Claude Sonnet 4.6	378/382 (99.0%)	298/302 (98.7%)
Gemini 3.1 Pro Preview	378/382 (99.0%)	298/302 (98.7%)
GPT-5.2	376/382 (98.4%)	296/302 (98.0%)
Qwen3.5-35B-A3B	370/382 (96.9%)	290/302 (96.0%)
Qwen3.5-397B-A17B	370/382 (96.9%)	292/302 (96.7%)
Qwen3.5-122B-A10B	367/382 (96.1%)	289/302 (95.7%)
Qwen3.5-27B	365/382 (95.5%)	291/302 (96.4%)
GPT-OSS-Swallow-120B-RL-v0.1	365/382 (95.5%)	289/302 (95.7%)
gpt-oss-120b (high)	362/382 (94.8%)	286/302 (94.7%)
Qwen3.5-27B (no-think)	359/382 (94.0%)	281/302 (93.0%)
Qwen3.5-122B-A10B (no-think)	355/382 (92.9%)	281/302 (93.0%)
Qwen3-Swallow-32B-RL-v0.2	355/382 (92.9%)	281/302 (93.0%)
Qwen3.5-35B-A3B (no-think)	354/382 (92.7%)	280/302 (92.7%)
Preferred-MedRECT-32B	351/382 (91.9%)	277/302 (91.7%)
Qwen3-32B	349/382 (91.4%)	273/302 (90.4%)
gpt-oss-120b (medium)	347/382 (90.8%)	275/302 (91.1%)
Qwen3.5-9B	340/382 (89.0%)	270/302 (89.4%)
gpt-oss-120b (low)	333/382 (87.2%)	263/302 (87.1%)
gpt-oss-20b (high)	332/382 (86.9%)	258/302 (85.4%)
gpt-oss-20b (medium)	330/382 (86.4%)	262/302 (86.8%)
Qwen3.5-9B (no-think)	325/382 (85.1%)	261/302 (86.4%)
Qwen3.5-4B	323/382 (84.5%)	253/302 (83.8%)
Qwen3-32B (no-think)	321/382 (84.0%)	251/302 (83.1%)
gpt-oss-20b (low)	299/382 (78.3%)	235/302 (77.8%)
Qwen3.5-4B (no-think)	290/382 (75.9%)	228/302 (75.5%)
Qwen3.5-2B (no-think)	192/382 (50.3%)	148/302 (49.0%)
Qwen3.5-2B	161/382 (42.1%)	121/302 (40.1%)
Qwen3.5-0.8B (no-think)	134/382 (35.1%)	104/302 (34.4%)
Qwen3.5-0.8B	123/382 (32.2%)	89/302 (29.5%)

Quick Start

Requires uv.

uv run benchmark.py --model gpt-5.2 --api-key $OPENAI_API_KEY

See usage.md for all options and reproduction commands.

Scoring

The score follows the official exam scoring system (500 points total):

Category	Blocks	Questions	Points	Max
Required (必修)	B, E	50 each	Q1-25: 1pt, Q26-50: 3pt	200
General (一般+臨床)	A, C, D, F	75 each	1pt each	300

Passing criteria (120th exam, official):

Required (B+E): 160/200 or higher (fixed every year)
General (A+C+D+F): 224/300 or higher (varies each year based on overall performance)
Prohibited choices (禁忌肢): 3 or fewer (fixed every year; which questions contain prohibited choices is not publicly disclosed)

Dataset

jmle2026_dataset.json: 400 questions (302 text-only, 98 with clinical images)
images/: 110 clinical images referenced by clinical_images field
Answers are based on the official answer key

License

Code: MIT
Dataset: CC BY 4.0
- Original exam data is published by the Ministry of Health, Labour and Welfare under PDL 1.0 (CC BY 4.0 compatible).
Results (results/): Each model's output is subject to the terms of service or license of the respective model provider. Use under the most permissive conditions allowed by each provider.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.streamlit		.streamlit
images		images
results		results
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
jmle2026_dataset.json		jmle2026_dataset.json
leaderboard.py		leaderboard.py
requirements.txt		requirements.txt
usage.md		usage.md
viewer.py		viewer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JMLE2026-Bench (IgakuQA120)

Leaderboard

All (400 Questions)

Text-only (302 Questions)

Quick Start

Scoring

Dataset

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

JMLE2026-Bench (IgakuQA120)

Leaderboard

All (400 Questions)

Text-only (302 Questions)

Quick Start

Scoring

Dataset

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages