Skip to content

Evaluation results #12

Description

@yingift

I found your benchmark interesting and implemented evaluation on the DT (Diagram and Table) with you code.
However, my results is significantly lower than the officially reported numbers. Here is my setting:
Model: Qwen2.5-vl-7B, resolution: 4k*4k, max_new_tokens=512.
The prompt follows your official code
The final results are:
Perception task: 53.03% (ours) 77.97% official leaderboard
Reasoning task: 46.40%(ours) 59.80% official leaderboard
Can the authors offer the evaluation details? This matters a lot.

Additionally, if I test with COT like adding 'Respond with the reason why you selected the choice', the results marginally improve.
I wonder if you use the COT in the evaluation? Can you specify these?

Hope you reply soon.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions