Evaluation results

I found your benchmark interesting and implemented evaluation on the DT (Diagram and Table) with you code. 
However, my results is significantly lower than the officially reported numbers. Here is my setting:
Model: Qwen2.5-vl-7B, resolution: 4k*4k, max_new_tokens=512.
The prompt follows your [official code](https://github.com/MME-Benchmarks/MME-RealWorld/blob/main/evaluation/download_and_prepare_prompt.py)
The final results are: 
Perception task: 53.03% (ours) 77.97% [official leaderboard](https://mme-realworld.github.io/home_page.html#leaderboard)
Reasoning task: 46.40%(ours) 59.80% [official leaderboard](https://mme-realworld.github.io/home_page.html#leaderboard)
**Can the authors offer the evaluation details? This matters a lot.**

Additionally, if I test with COT like adding 'Respond with the reason why you selected the choice', the results marginally improve.
**I wonder if you use the **COT** in the evaluation? Can you specify these?**

Hope you reply soon.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation results #12

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Evaluation results #12

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions