Fix table-to-text rescue not wired into text matching (#238) by Zeng-Weijun · Pull Request #241 · opendatalab/OmniDocBench

Zeng-Weijun · 2026-06-12T18:10:50Z

Summary

Fixes #238.

When a page's ground truth contains no table but the model outputs that region as a table, the end-to-end matcher extracts the table (html_table / md2html_table / latex_table) and then drops it. The corresponding GT text rows are left unmatched (Edit_dist = 1.0, empty pred_category_type), so identical content is scored very differently depending only on output format.

The technical report states that such mis-recognized tables should be linearized back to plain text and matched in the same pipeline. The helper split_pred_table_to_text_items (and table_to_text_lines) already implements the linearization, but it was never wired into process_get_matched_elements — the unmatch_table_pred it returns was received and then unused, and the GT-no-table case never reached it at all.

Fix

In process_get_matched_elements (src/dataset/end2end_dataset.py):

GT-no-table pages: linearize predicted tables via split_pred_table_to_text_items and add them to the text-matching pool (pred_dataset_mix).
GT-has-table pages: also feed the leftover unmatched predicted tables (the unmatch_table_pred that was previously dropped) into the text pool.

9 added lines, 1 file, no deletions.

Verification

Minimal repro from the issue (identical content, text vs markdown-table prediction, GT is plain text):

prediction format	text_block Edit_dist (before)	(after)
plain text	0.000	0.000
markdown table	1.000	0.000

After the fix, the rescued rows match with pred_category_type = table_to_text, confirming the previously-dead path is exercised.

Table TEDS and formula (CDM) metrics are unaffected — verified unchanged on GT-has-table pages; only text_block / reading_order matching gains the previously-dropped content.

…talab#238) When the ground truth contains no table but a model outputs that region as a table, the end-to-end matcher extracts it as html_table/md2html_table and then drops it, leaving the corresponding GT text rows unmatched (Edit_dist = 1.0, empty pred_category_type). The technical report states such tables should be linearized back to plain text and matched in the same pipeline; the helper split_pred_table_to_text_items already existed but was never wired into process_get_matched_elements. This connects the rescue path: - GT-no-table pages: linearize predicted tables to text and add them to the text-matching pool (pred_dataset_mix). - GT-has-table pages: also feed the leftover unmatched predicted tables (previously returned as unmatch_table_pred but unused) into the text pool. Table TEDS and formula metrics are unaffected; only text_block / reading_order matching gains the previously-dropped content.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix table-to-text rescue not wired into text matching (#238)#241

Fix table-to-text rescue not wired into text matching (#238)#241
Zeng-Weijun wants to merge 1 commit into
opendatalab:mainfrom
Zeng-Weijun:fix/issue-238-table-to-text-rescue

Zeng-Weijun commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Zeng-Weijun commented Jun 12, 2026

Summary

Fix

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant