Thanks for your great work!
While comparing the subtitle interleaving logic between the two implementations, I noticed they behave differently:
- The two evaluation scripts implement subtitle interleaving differently.
|
vlmeval/dataset/videommev2.py (VLMEvalKit) |
test_video_mme_v2.py (standalone) |
| Matching granularity |
Word-level (subtitle_between_timestamps) |
Sentence-level (group_subtitle_segments → segments_between_timestamps) |
| Timestamp label |
Frame window timestamps |
Segment's own timestamps |
| Output per frame |
One merged text chunk |
Multiple independent segment entries |
- The Duplication Problem in
test_video_mme_v2.py
segments_between_timestamps uses overlap matching:
if seg['end_time'] >= start_time and seg['start_time'] < end_time:
A sentence-level segment spanning multiple frame windows gets matched — and fully repeated — under every overlapping frame. For example, a 2.5s subtitle at fps=2 appears 5 times:
[Subtitle 179.74s - 182.22s]: doing actual practical time traveling.
Frame-309: <image>
[Subtitle 179.74s - 182.22s]: doing actual practical time traveling.
Frame-310: <image>
[Subtitle 179.74s - 182.22s]: doing actual practical time traveling.
Frame-311: <image>
[Subtitle 179.74s - 182.22s]: doing actual practical time traveling.
Frame-312: <image>
[Subtitle 179.74s - 182.22s]: doing actual practical time traveling.
Questions
- Which script's interleaving behavior is the intended one — word-level (
videomme_v2.py) or sentence-level (test_video_mme_v2.py)?
- Is the sentence-level duplication in
test_video_mme_v2.py expected behavior, or should it be deduplicated (e.g., only emit a subtitle in the frame where its start_time falls)?
Thanks for your great work!
While comparing the subtitle interleaving logic between the two implementations, I noticed they behave differently:
vlmeval/dataset/videommev2.py(VLMEvalKit)test_video_mme_v2.py(standalone)subtitle_between_timestamps)group_subtitle_segments→segments_between_timestamps)test_video_mme_v2.pysegments_between_timestampsuses overlap matching:A sentence-level segment spanning multiple frame windows gets matched — and fully repeated — under every overlapping frame. For example, a 2.5s subtitle at fps=2 appears 5 times:
Questions
videomme_v2.py) or sentence-level (test_video_mme_v2.py)?test_video_mme_v2.pyexpected behavior, or should it be deduplicated (e.g., only emit a subtitle in the frame where itsstart_timefalls)?