Hi,
I noticed something in the README that seems a bit surprising and wanted to check if it's intentional or a typo.
In the DROP benchmark results, the accuracy scores listed are:
gpt-4.1: 79.4
gpt-4.1-mini: 81.0
gpt-4.1-nano: 82.2
I find it odd that in a benchmark that evaluates reasoning over paragraphs, gpt-4.1-nano outperforms both gpt-4.1-mini and gpt-4.1. Could you confirm if these results are accurate?
Appreciate your time and the effort behind this repo!
Best,
Iker