Fix table-to-text rescue not wired into text matching (#238)#241
Open
Zeng-Weijun wants to merge 1 commit into
Open
Fix table-to-text rescue not wired into text matching (#238)#241Zeng-Weijun wants to merge 1 commit into
Zeng-Weijun wants to merge 1 commit into
Conversation
…talab#238) When the ground truth contains no table but a model outputs that region as a table, the end-to-end matcher extracts it as html_table/md2html_table and then drops it, leaving the corresponding GT text rows unmatched (Edit_dist = 1.0, empty pred_category_type). The technical report states such tables should be linearized back to plain text and matched in the same pipeline; the helper split_pred_table_to_text_items already existed but was never wired into process_get_matched_elements. This connects the rescue path: - GT-no-table pages: linearize predicted tables to text and add them to the text-matching pool (pred_dataset_mix). - GT-has-table pages: also feed the leftover unmatched predicted tables (previously returned as unmatch_table_pred but unused) into the text pool. Table TEDS and formula metrics are unaffected; only text_block / reading_order matching gains the previously-dropped content.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #238.
When a page's ground truth contains no table but the model outputs that region as a table, the end-to-end matcher extracts the table (
html_table/md2html_table/latex_table) and then drops it. The corresponding GT text rows are left unmatched (Edit_dist = 1.0, emptypred_category_type), so identical content is scored very differently depending only on output format.The technical report states that such mis-recognized tables should be linearized back to plain text and matched in the same pipeline. The helper
split_pred_table_to_text_items(andtable_to_text_lines) already implements the linearization, but it was never wired intoprocess_get_matched_elements— theunmatch_table_predit returns was received and then unused, and the GT-no-table case never reached it at all.Fix
In
process_get_matched_elements(src/dataset/end2end_dataset.py):split_pred_table_to_text_itemsand add them to the text-matching pool (pred_dataset_mix).unmatch_table_predthat was previously dropped) into the text pool.9 added lines, 1 file, no deletions.
Verification
Minimal repro from the issue (identical content, text vs markdown-table prediction, GT is plain text):
After the fix, the rescued rows match with
pred_category_type = table_to_text, confirming the previously-dead path is exercised.Table TEDS and formula (CDM) metrics are unaffected — verified unchanged on GT-has-table pages; only
text_block/reading_ordermatching gains the previously-dropped content.