I found your benchmark interesting and implemented evaluation on the DT (Diagram and Table) with you code.
However, my results is significantly lower than the officially reported numbers. Here is my setting:
Model: Qwen2.5-vl-7B, resolution: 4k*4k, max_new_tokens=512.
The prompt follows your official code
The final results are:
Perception task: 53.03% (ours) 77.97% official leaderboard
Reasoning task: 46.40%(ours) 59.80% official leaderboard
Can the authors offer the evaluation details? This matters a lot.
Additionally, if I test with COT like adding 'Respond with the reason why you selected the choice', the results marginally improve.
I wonder if you use the COT in the evaluation? Can you specify these?
Hope you reply soon.
I found your benchmark interesting and implemented evaluation on the DT (Diagram and Table) with you code.
However, my results is significantly lower than the officially reported numbers. Here is my setting:
Model: Qwen2.5-vl-7B, resolution: 4k*4k, max_new_tokens=512.
The prompt follows your official code
The final results are:
Perception task: 53.03% (ours) 77.97% official leaderboard
Reasoning task: 46.40%(ours) 59.80% official leaderboard
Can the authors offer the evaluation details? This matters a lot.
Additionally, if I test with COT like adding 'Respond with the reason why you selected the choice', the results marginally improve.
I wonder if you use the COT in the evaluation? Can you specify these?
Hope you reply soon.