Skip to content

Fix table-to-text rescue not wired into text matching (#238)#241

Open
Zeng-Weijun wants to merge 1 commit into
opendatalab:mainfrom
Zeng-Weijun:fix/issue-238-table-to-text-rescue
Open

Fix table-to-text rescue not wired into text matching (#238)#241
Zeng-Weijun wants to merge 1 commit into
opendatalab:mainfrom
Zeng-Weijun:fix/issue-238-table-to-text-rescue

Conversation

@Zeng-Weijun

Copy link
Copy Markdown
Contributor

Summary

Fixes #238.

When a page's ground truth contains no table but the model outputs that region as a table, the end-to-end matcher extracts the table (html_table / md2html_table / latex_table) and then drops it. The corresponding GT text rows are left unmatched (Edit_dist = 1.0, empty pred_category_type), so identical content is scored very differently depending only on output format.

The technical report states that such mis-recognized tables should be linearized back to plain text and matched in the same pipeline. The helper split_pred_table_to_text_items (and table_to_text_lines) already implements the linearization, but it was never wired into process_get_matched_elements — the unmatch_table_pred it returns was received and then unused, and the GT-no-table case never reached it at all.

Fix

In process_get_matched_elements (src/dataset/end2end_dataset.py):

  • GT-no-table pages: linearize predicted tables via split_pred_table_to_text_items and add them to the text-matching pool (pred_dataset_mix).
  • GT-has-table pages: also feed the leftover unmatched predicted tables (the unmatch_table_pred that was previously dropped) into the text pool.

9 added lines, 1 file, no deletions.

Verification

Minimal repro from the issue (identical content, text vs markdown-table prediction, GT is plain text):

prediction format text_block Edit_dist (before) (after)
plain text 0.000 0.000
markdown table 1.000 0.000

After the fix, the rescued rows match with pred_category_type = table_to_text, confirming the previously-dead path is exercised.

Table TEDS and formula (CDM) metrics are unaffected — verified unchanged on GT-has-table pages; only text_block / reading_order matching gains the previously-dropped content.

…talab#238)

When the ground truth contains no table but a model outputs that region as
a table, the end-to-end matcher extracts it as html_table/md2html_table and
then drops it, leaving the corresponding GT text rows unmatched
(Edit_dist = 1.0, empty pred_category_type). The technical report states such
tables should be linearized back to plain text and matched in the same
pipeline; the helper split_pred_table_to_text_items already existed but was
never wired into process_get_matched_elements.

This connects the rescue path:
- GT-no-table pages: linearize predicted tables to text and add them to the
  text-matching pool (pred_dataset_mix).
- GT-has-table pages: also feed the leftover unmatched predicted tables
  (previously returned as unmatch_table_pred but unused) into the text pool.

Table TEDS and formula metrics are unaffected; only text_block / reading_order
matching gains the previously-dropped content.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

关于v1.6报告中提到的修复匹配bias

1 participant