Skip to content

Add structured invoice metadata extraction example#47

Open
vhung13 wants to merge 3 commits into
tommyGPT2S:mainfrom
vhung13:feature/invoice-metadata-extraction
Open

Add structured invoice metadata extraction example#47
vhung13 wants to merge 3 commits into
tommyGPT2S:mainfrom
vhung13:feature/invoice-metadata-extraction

Conversation

@vhung13

@vhung13 vhung13 commented Jun 24, 2026

Copy link
Copy Markdown

Summary

This PR adds a new example demonstrating how to add to DocEX's generic PDF processing pipeline with lightweight business-specific parsing to derive structured invoice-related metadata from extracted document text.

The example processes the sample invoice PDF and extracts fields including:

  • Invoice number
  • Purchase order number
  • Invoice date
  • Total amount

Rather than simply extracting raw text, the example shows one approach for transforming unstructured document text into typed business data that can be consumed by downstream workflows like semantic search, agent orchestration, etc.

Why a new example?

I chose to add a separate example instead of modifying pdf_invoice_to_purchase_order.py to keep the existing example focused on its original purpose while also demonstrating a different workflow.

The existing examples already illustrate document ingestion and PDF text extraction. This example builds on that by showing how we can interpret the extracted text and produce structured business metadata, while keeping the underlying PDF processor generic and reusable.

Additional improvements

  • Updated the custom processor example to match the current processor factory rule signature (pdf_rule(document, db=None)).
  • Updated the custom processor README to use the correct function signature and execution command.

Validation

Verified both the new and existing examples run successfully:

  • python examples/invoice_metadata_extraction.py
  • python examples/custom_processors/run_custom_pdf_processor.py

Verified the new example passes syntax and formatting checks:

  • python -m py_compile examples/invoice_metadata_extraction.py
  • python -m tabnanny examples/invoice_metadata_extraction.py

vhung13 added 3 commits June 24, 2026 16:48
Updates the custom PDF processor example to match the current processor
factory API, which now passes an optional database parameter into mapping
rules.

Also ignores local virtual environments and generated storage artifacts
to keep development-specific files out of version control.
Adds a new example demonstrating how to compose DocEX's generic PDF
processing pipeline with lightweight business-specific parsing.

Unlike pdf_invoice_to_purchase_order.py, which demonstrates document
processing and metadata lookup, this example focuses on transforming
extracted PDF text into structured invoice metadata including invoice
number, PO number, invoice date, and total amount.

The PDF processor remains generic while invoice-specific parsing is
implemented as a separate downstream step.
Simplifies the invoice metadata example by removing the unused PO lookup
workflow inherited from the original example so it focuses exclusively on
structured invoice metadata extraction.

Also removes duplicate imports, improves documentation, and updates the
custom processor README to reflect the current API and correct execution
command.
@vhung13 vhung13 force-pushed the feature/invoice-metadata-extraction branch from 9b16a5a to 7074a1d Compare June 24, 2026 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant