When reproducing the reported results on GPT-4o, I noticed that the responses API and completions API result in different numbers beyond expected error bars, especially on gpt-4o-2024-08-06.
My results on GPT-4o:
- Completions API: 53.0% (match)
- Responses API: 49.2% (no match)
I only changed one line of code: from ChatCompletionsSampler to ResponsesSampler
|
"gpt-4o-2024-08-06": ChatCompletionSampler( |
Command: python -m simple-evals.simple_evals --model gpt-4o-2024-08-06 --eval gpqa --n-repeats 10