Question about cal_logic scoring for parallel group structures

Hi, thanks for the great work on Video-MME v2!


I've been using the evaluation script and noticed some potentially unexpected behavior in the `cal_logic` function when dealing with parallel group structures. I'm not 100% sure if this is intentional, so I'd like to discuss it here.

The core loop in cal_logic iterates through scores sequentially and breaks on the first 0:

```python
last_correct_idx = -1
for idx, val in enumerate(scores):
    if val:
        last_correct_idx = idx
    else:
        break
```

There are special-case patches for parallel groups after this loop. However, it seems like they might not cover all the cases. Let me walk through a few examples:

### Example 1: `[[1, 2], 3, 4]` with `scores = [1, 0, 1, 1]`

Q1 and Q2 are parallel — either one being correct should mean the parallel group passes. Here Q1 is correct, so the group should pass.

- Loop: idx=0 (val=1) → lci=0; idx=1 (val=0) → **break**
- The patch checks `if last_correct_idx == -1 and scores[1]`, but lci is 0, not -1
- Result: `score_map[1] = 10.0`

Since Q1 already satisfies the parallel group, and Q3/Q4 are both correct, should this score higher? Based on our understanding, Q1 earns 1/10, Q3 earns 3/10, Q4 earns 5/10, giving **90.0** instead of 10.0. It seems like the patch only covers the case where Q1 is wrong and Q2 is right, but not the reverse (Q1 right, Q2 wrong).

### Example 2: `[[1, 2], 3, 4]` with `scores = [0, 1, 1, 1]`

- Loop: idx=0 (val=0) → **break**, lci=-1
- Patch triggers: lci becomes 0
- Result: `score_map[1] = 10.0`

The patch correctly identifies that Q2 passes the parallel group, but `last_correct_idx` only advances to 0 and doesn't continue checking Q3 and Q4. With Q3 and Q4 both correct and their prerequisites met, we'd expect Q2 earns 1/10, Q3 earns 3/10, Q4 earns 5/10, giving **90.0** instead of 10.0. Should the chain continue after the parallel group patch?

### Example 3: `[1, [2, 3], 4]` with `scores = [1, 1, 0, 1]`

Q2 and Q3 are parallel. Q2 is correct, so the parallel group passes.

- Loop: idx=0 (val=1) → lci=0; idx=1 (val=1) → lci=1; idx=2 (val=0) → **break**
- Patch checks `if last_correct_idx == 0 and scores[2]`, but lci is 1, not 0
- Result: `score_map[2] = 33.33`

Q2 already proves the parallel group passes, and Q4 is also correct — we'd expect Q1 earns 1/12, Q2 earns 3/12, Q4 earns 5/12, giving **75.0** instead of 33.33. Should this be scored higher?



Would appreciate any clarification on whether this is the intended behavior or if we're misunderstanding the scoring design. Thanks!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about cal_logic scoring for parallel group structures #2

Example 1: `[[1, 2], 3, 4]` with `scores = [1, 0, 1, 1]`

Example 2: `[[1, 2], 3, 4]` with `scores = [0, 1, 1, 1]`

Example 3: `[1, [2, 3], 4]` with `scores = [1, 1, 0, 1]`

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Question about cal_logic scoring for parallel group structures #2

Description

Example 1: [[1, 2], 3, 4] with scores = [1, 0, 1, 1]

Example 2: [[1, 2], 3, 4] with scores = [0, 1, 1, 1]

Example 3: [1, [2, 3], 4] with scores = [1, 1, 0, 1]

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Example 1: `[[1, 2], 3, 4]` with `scores = [1, 0, 1, 1]`

Example 2: `[[1, 2], 3, 4]` with `scores = [0, 1, 1, 1]`

Example 3: `[1, [2, 3], 4]` with `scores = [1, 1, 0, 1]`