I'm exploring the use of a Vision-Language Model (VLM) as a layout detector to handle complex and uncommon document structures. In particular, we want to improve performance on documents like Bills of Lading that often include irregular layouts.
Any recommended approaches, prompt patterns, or references would be greatly appreciated. Thank you!
Example:
example
only the part of customer order is table, and other blocks are text