Add structured invoice metadata extraction example by vhung13 · Pull Request #47 · tommyGPT2S/DocEX

vhung13 · 2026-06-24T21:36:27Z

Summary

This PR adds a new example demonstrating how to add to DocEX's generic PDF processing pipeline with lightweight business-specific parsing to derive structured invoice-related metadata from extracted document text.

The example processes the sample invoice PDF and extracts fields including:

Invoice number
Purchase order number
Invoice date
Total amount

Rather than simply extracting raw text, the example shows one approach for transforming unstructured document text into typed business data that can be consumed by downstream workflows like semantic search, agent orchestration, etc.

Why a new example?

I chose to add a separate example instead of modifying pdf_invoice_to_purchase_order.py to keep the existing example focused on its original purpose while also demonstrating a different workflow.

The existing examples already illustrate document ingestion and PDF text extraction. This example builds on that by showing how we can interpret the extracted text and produce structured business metadata, while keeping the underlying PDF processor generic and reusable.

Additional improvements

Updated the custom processor example to match the current processor factory rule signature (pdf_rule(document, db=None)).
Updated the custom processor README to use the correct function signature and execution command.

Validation

Verified both the new and existing examples run successfully:

python examples/invoice_metadata_extraction.py
python examples/custom_processors/run_custom_pdf_processor.py

Verified the new example passes syntax and formatting checks:

python -m py_compile examples/invoice_metadata_extraction.py
python -m tabnanny examples/invoice_metadata_extraction.py

Updates the custom PDF processor example to match the current processor factory API, which now passes an optional database parameter into mapping rules. Also ignores local virtual environments and generated storage artifacts to keep development-specific files out of version control.

Adds a new example demonstrating how to compose DocEX's generic PDF processing pipeline with lightweight business-specific parsing. Unlike pdf_invoice_to_purchase_order.py, which demonstrates document processing and metadata lookup, this example focuses on transforming extracted PDF text into structured invoice metadata including invoice number, PO number, invoice date, and total amount. The PDF processor remains generic while invoice-specific parsing is implemented as a separate downstream step.

Simplifies the invoice metadata example by removing the unused PO lookup workflow inherited from the original example so it focuses exclusively on structured invoice metadata extraction. Also removes duplicate imports, improves documentation, and updates the custom processor README to reflect the current API and correct execution command.

vhung13 added 3 commits June 24, 2026 16:48

vhung13 force-pushed the feature/invoice-metadata-extraction branch from 9b16a5a to 7074a1d Compare June 24, 2026 21:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add structured invoice metadata extraction example#47

Add structured invoice metadata extraction example#47
vhung13 wants to merge 3 commits into
tommyGPT2S:mainfrom
vhung13:feature/invoice-metadata-extraction

vhung13 commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vhung13 commented Jun 24, 2026

Summary

Why a new example?

Additional improvements

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant