Skip to content

Question about cal_logic scoring for parallel group structures #2

Description

@yj772881654

Hi, thanks for the great work on Video-MME v2!

I've been using the evaluation script and noticed some potentially unexpected behavior in the cal_logic function when dealing with parallel group structures. I'm not 100% sure if this is intentional, so I'd like to discuss it here.

The core loop in cal_logic iterates through scores sequentially and breaks on the first 0:

last_correct_idx = -1
for idx, val in enumerate(scores):
    if val:
        last_correct_idx = idx
    else:
        break

There are special-case patches for parallel groups after this loop. However, it seems like they might not cover all the cases. Let me walk through a few examples:

Example 1: [[1, 2], 3, 4] with scores = [1, 0, 1, 1]

Q1 and Q2 are parallel — either one being correct should mean the parallel group passes. Here Q1 is correct, so the group should pass.

  • Loop: idx=0 (val=1) → lci=0; idx=1 (val=0) → break
  • The patch checks if last_correct_idx == -1 and scores[1], but lci is 0, not -1
  • Result: score_map[1] = 10.0

Since Q1 already satisfies the parallel group, and Q3/Q4 are both correct, should this score higher? Based on our understanding, Q1 earns 1/10, Q3 earns 3/10, Q4 earns 5/10, giving 90.0 instead of 10.0. It seems like the patch only covers the case where Q1 is wrong and Q2 is right, but not the reverse (Q1 right, Q2 wrong).

Example 2: [[1, 2], 3, 4] with scores = [0, 1, 1, 1]

  • Loop: idx=0 (val=0) → break, lci=-1
  • Patch triggers: lci becomes 0
  • Result: score_map[1] = 10.0

The patch correctly identifies that Q2 passes the parallel group, but last_correct_idx only advances to 0 and doesn't continue checking Q3 and Q4. With Q3 and Q4 both correct and their prerequisites met, we'd expect Q2 earns 1/10, Q3 earns 3/10, Q4 earns 5/10, giving 90.0 instead of 10.0. Should the chain continue after the parallel group patch?

Example 3: [1, [2, 3], 4] with scores = [1, 1, 0, 1]

Q2 and Q3 are parallel. Q2 is correct, so the parallel group passes.

  • Loop: idx=0 (val=1) → lci=0; idx=1 (val=1) → lci=1; idx=2 (val=0) → break
  • Patch checks if last_correct_idx == 0 and scores[2], but lci is 1, not 0
  • Result: score_map[2] = 33.33

Q2 already proves the parallel group passes, and Q4 is also correct — we'd expect Q1 earns 1/12, Q2 earns 3/12, Q4 earns 5/12, giving 75.0 instead of 33.33. Should this be scored higher?

Would appreciate any clarification on whether this is the intended behavior or if we're misunderstanding the scoring design. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions