Commercial real estate invoice intake and reconciliation#46
Open
maccora wants to merge 1 commit into
Open
Conversation
Reads CRE invoice PDFs of arbitrary layout, extracts their fields, and reconciles them against recorded ground-truth lease actuals (base rent, CAM, real estate tax and insurance recoveries, totals, square footage, pro-rata share) so an overcharge is caught before it is paid. Highlights: - Cost-ordered extraction cascade: a free deterministic heuristic first, with an optional caller-provided LLM invoked only for unresolved or disputed fields. A clean invoice costs zero LLM calls. - Self-improving learning loop: label phrasings confirmed against ground truth are recorded and promoted into the heuristic, so the cheap tier absorbs vendor formatting over time and the LLM is needed less. - Tolerance-based reconciliation with per-charge comparison; verdicts are matched, discrepancy, incomplete, or unresolved. - Ground-truth stores (in-memory and DocEX-backed) matched by identifier, plus optional embedding-similarity retrieval to find the closest lease when an invoice has no clean identifier. Retrieval never overrides reconciliation, so a wrong match surfaces as a discrepancy. - DocEX BaseProcessor integration and an example Anthropic adapter for the LLM tier; the core pipeline carries no provider or database dependency. Tested with unit coverage, randomized invoice scenarios, the learning loop end to end, stubbed and live LLM paths, and committed positive and negative realistic invoice PDFs under example_docs for non-technical review.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a self-contained subsystem under
docex/intake/that reads commercialreal estate invoice PDFs of arbitrary layout, extracts their fields, and
reconciles them against recorded ground-truth lease actuals (base rent, CAM,
real estate tax and insurance recoveries, totals, square footage, pro-rata
share) so an overcharge can be caught before it is paid.
Design
first; an optional caller-provided LLM is invoked only for unresolved or
disputed fields. A clean invoice costs zero LLM calls. The LLM is a BYO
callable, so the core takes no hard provider dependency.
truth are recorded and promoted into the heuristic's alias set, so the cheap
tier absorbs vendor formatting over time and the LLM is needed less.
dates) with per-charge comparison by category. Verdicts are matched,
discrepancy, incomplete, or unresolved.
(invoice number, then PO). An optional embedding-similarity matcher provides
fuzzy retrieval to find the closest lease when an invoice has no clean
identifier; it never overrides reconciliation, so a wrong match shows up as a
discrepancy rather than being silently trusted.
InvoiceIntakeProcessorwraps the pipeline as aBaseProcessor; the core pipeline itself carries no database dependency.A note on a deliberate decision: an embedding-similarity tier was considered for
field extraction and left out, because the learning loop already removes its
only benefit and it carries a false-positive risk the cascade cannot cheaply
audit. The rationale is documented in
docex/intake/extractors/cascade.pyandthe package README.
Testing
extraction traps, reconciliation tolerances, matching, learning).
ANTHROPIC_API_KEY.example_docs/cre_invoices/(apositive one that reconciles, and a negative one overstating CAM that is
flagged), run through the full pipeline including pdfminer.
165 tests pass; lint is clean. New optional dependencies are confined to the
existing
pdfextra (pdfminer) and dev/test tooling (reportlab).See
docex/intake/README.mdfor the full design, testing notes, and theassumptions the intake makes about its inputs.