Add structured invoice metadata extraction example#47
Open
vhung13 wants to merge 3 commits into
Open
Conversation
Updates the custom PDF processor example to match the current processor factory API, which now passes an optional database parameter into mapping rules. Also ignores local virtual environments and generated storage artifacts to keep development-specific files out of version control.
Adds a new example demonstrating how to compose DocEX's generic PDF processing pipeline with lightweight business-specific parsing. Unlike pdf_invoice_to_purchase_order.py, which demonstrates document processing and metadata lookup, this example focuses on transforming extracted PDF text into structured invoice metadata including invoice number, PO number, invoice date, and total amount. The PDF processor remains generic while invoice-specific parsing is implemented as a separate downstream step.
Simplifies the invoice metadata example by removing the unused PO lookup workflow inherited from the original example so it focuses exclusively on structured invoice metadata extraction. Also removes duplicate imports, improves documentation, and updates the custom processor README to reflect the current API and correct execution command.
9b16a5a to
7074a1d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a new example demonstrating how to add to DocEX's generic PDF processing pipeline with lightweight business-specific parsing to derive structured invoice-related metadata from extracted document text.
The example processes the sample invoice PDF and extracts fields including:
Rather than simply extracting raw text, the example shows one approach for transforming unstructured document text into typed business data that can be consumed by downstream workflows like semantic search, agent orchestration, etc.
Why a new example?
I chose to add a separate example instead of modifying
pdf_invoice_to_purchase_order.pyto keep the existing example focused on its original purpose while also demonstrating a different workflow.The existing examples already illustrate document ingestion and PDF text extraction. This example builds on that by showing how we can interpret the extracted text and produce structured business metadata, while keeping the underlying PDF processor generic and reusable.
Additional improvements
pdf_rule(document, db=None)).Validation
Verified both the new and existing examples run successfully:
python examples/invoice_metadata_extraction.pypython examples/custom_processors/run_custom_pdf_processor.pyVerified the new example passes syntax and formatting checks:
python -m py_compile examples/invoice_metadata_extraction.pypython -m tabnanny examples/invoice_metadata_extraction.py