Possible Typo in DROP Benchmark Accuracy

Hi,

I noticed something in the README that seems a bit surprising and wanted to check if it's intentional or a typo.

In the **DROP** benchmark results, the accuracy scores listed are:

- `gpt-4.1`: 79.4  
- `gpt-4.1-mini`: 81.0  
- `gpt-4.1-nano`: 82.2

I find it odd that in a benchmark that evaluates reasoning over paragraphs, `gpt-4.1-nano` outperforms both `gpt-4.1-mini` and `gpt-4.1`. Could you confirm if these results are accurate?

Appreciate your time and the effort behind this repo!

Best,  
Iker

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Typo in DROP Benchmark Accuracy #90

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Possible Typo in DROP Benchmark Accuracy #90

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions