Hi, thanks for the great work on Video-MME v2!
I've been using the evaluation script and noticed some potentially unexpected behavior in the cal_logic function when dealing with parallel group structures. I'm not 100% sure if this is intentional, so I'd like to discuss it here.
The core loop in cal_logic iterates through scores sequentially and breaks on the first 0:
last_correct_idx = -1
for idx, val in enumerate(scores):
if val:
last_correct_idx = idx
else:
break
There are special-case patches for parallel groups after this loop. However, it seems like they might not cover all the cases. Let me walk through a few examples:
Example 1: [[1, 2], 3, 4] with scores = [1, 0, 1, 1]
Q1 and Q2 are parallel — either one being correct should mean the parallel group passes. Here Q1 is correct, so the group should pass.
- Loop: idx=0 (val=1) → lci=0; idx=1 (val=0) → break
- The patch checks
if last_correct_idx == -1 and scores[1], but lci is 0, not -1
- Result:
score_map[1] = 10.0
Since Q1 already satisfies the parallel group, and Q3/Q4 are both correct, should this score higher? Based on our understanding, Q1 earns 1/10, Q3 earns 3/10, Q4 earns 5/10, giving 90.0 instead of 10.0. It seems like the patch only covers the case where Q1 is wrong and Q2 is right, but not the reverse (Q1 right, Q2 wrong).
Example 2: [[1, 2], 3, 4] with scores = [0, 1, 1, 1]
- Loop: idx=0 (val=0) → break, lci=-1
- Patch triggers: lci becomes 0
- Result:
score_map[1] = 10.0
The patch correctly identifies that Q2 passes the parallel group, but last_correct_idx only advances to 0 and doesn't continue checking Q3 and Q4. With Q3 and Q4 both correct and their prerequisites met, we'd expect Q2 earns 1/10, Q3 earns 3/10, Q4 earns 5/10, giving 90.0 instead of 10.0. Should the chain continue after the parallel group patch?
Example 3: [1, [2, 3], 4] with scores = [1, 1, 0, 1]
Q2 and Q3 are parallel. Q2 is correct, so the parallel group passes.
- Loop: idx=0 (val=1) → lci=0; idx=1 (val=1) → lci=1; idx=2 (val=0) → break
- Patch checks
if last_correct_idx == 0 and scores[2], but lci is 1, not 0
- Result:
score_map[2] = 33.33
Q2 already proves the parallel group passes, and Q4 is also correct — we'd expect Q1 earns 1/12, Q2 earns 3/12, Q4 earns 5/12, giving 75.0 instead of 33.33. Should this be scored higher?
Would appreciate any clarification on whether this is the intended behavior or if we're misunderstanding the scoring design. Thanks!
Hi, thanks for the great work on Video-MME v2!
I've been using the evaluation script and noticed some potentially unexpected behavior in the
cal_logicfunction when dealing with parallel group structures. I'm not 100% sure if this is intentional, so I'd like to discuss it here.The core loop in cal_logic iterates through scores sequentially and breaks on the first 0:
There are special-case patches for parallel groups after this loop. However, it seems like they might not cover all the cases. Let me walk through a few examples:
Example 1:
[[1, 2], 3, 4]withscores = [1, 0, 1, 1]Q1 and Q2 are parallel — either one being correct should mean the parallel group passes. Here Q1 is correct, so the group should pass.
if last_correct_idx == -1 and scores[1], but lci is 0, not -1score_map[1] = 10.0Since Q1 already satisfies the parallel group, and Q3/Q4 are both correct, should this score higher? Based on our understanding, Q1 earns 1/10, Q3 earns 3/10, Q4 earns 5/10, giving 90.0 instead of 10.0. It seems like the patch only covers the case where Q1 is wrong and Q2 is right, but not the reverse (Q1 right, Q2 wrong).
Example 2:
[[1, 2], 3, 4]withscores = [0, 1, 1, 1]score_map[1] = 10.0The patch correctly identifies that Q2 passes the parallel group, but
last_correct_idxonly advances to 0 and doesn't continue checking Q3 and Q4. With Q3 and Q4 both correct and their prerequisites met, we'd expect Q2 earns 1/10, Q3 earns 3/10, Q4 earns 5/10, giving 90.0 instead of 10.0. Should the chain continue after the parallel group patch?Example 3:
[1, [2, 3], 4]withscores = [1, 1, 0, 1]Q2 and Q3 are parallel. Q2 is correct, so the parallel group passes.
if last_correct_idx == 0 and scores[2], but lci is 1, not 0score_map[2] = 33.33Q2 already proves the parallel group passes, and Q4 is also correct — we'd expect Q1 earns 1/12, Q2 earns 3/12, Q4 earns 5/12, giving 75.0 instead of 33.33. Should this be scored higher?
Would appreciate any clarification on whether this is the intended behavior or if we're misunderstanding the scoring design. Thanks!