Skip to content

Commercial real estate invoice intake and reconciliation#46

Open
maccora wants to merge 1 commit into
tommyGPT2S:mainfrom
maccora:maccora/feature/pdf-intake
Open

Commercial real estate invoice intake and reconciliation#46
maccora wants to merge 1 commit into
tommyGPT2S:mainfrom
maccora:maccora/feature/pdf-intake

Conversation

@maccora

@maccora maccora commented Jun 23, 2026

Copy link
Copy Markdown

Summary

Adds a self-contained subsystem under docex/intake/ that reads commercial
real estate invoice PDFs of arbitrary layout, extracts their fields, and
reconciles them against recorded ground-truth lease actuals (base rent, CAM,
real estate tax and insurance recoveries, totals, square footage, pro-rata
share) so an overcharge can be caught before it is paid.

Design

  • Cost-ordered extraction cascade. A free, deterministic heuristic runs
    first; an optional caller-provided LLM is invoked only for unresolved or
    disputed fields. A clean invoice costs zero LLM calls. The LLM is a BYO
    callable, so the core takes no hard provider dependency.
  • Self-improving learning loop. Label phrasings confirmed against ground
    truth are recorded and promoted into the heuristic's alias set, so the cheap
    tier absorbs vendor formatting over time and the LLM is needed less.
  • Reconciliation. Type-aware tolerances (a cent on money, configurable on
    dates) with per-charge comparison by category. Verdicts are matched,
    discrepancy, incomplete, or unresolved.
  • Ground-truth retrieval. Records are matched by stable identifier
    (invoice number, then PO). An optional embedding-similarity matcher provides
    fuzzy retrieval to find the closest lease when an invoice has no clean
    identifier; it never overrides reconciliation, so a wrong match shows up as a
    discrepancy rather than being silently trusted.
  • DocEX integration. An InvoiceIntakeProcessor wraps the pipeline as a
    BaseProcessor; the core pipeline itself carries no database dependency.

A note on a deliberate decision: an embedding-similarity tier was considered for
field extraction and left out, because the learning loop already removes its
only benefit and it carries a false-positive risk the cascade cannot cheaply
audit. The rationale is documented in docex/intake/extractors/cascade.py and
the package README.

Testing

  • Unit coverage of every layer (normalization, charge taxonomy, heuristic
    extraction traps, reconciliation tolerances, matching, learning).
  • Randomized invoice scenarios across varied layouts, labels, and values.
  • The self-improving loop verified end to end.
  • LLM tier covered by deterministic stubs plus a live test gated on
    ANTHROPIC_API_KEY.
  • Two realistic invoice PDFs committed under example_docs/cre_invoices/ (a
    positive one that reconciles, and a negative one overstating CAM that is
    flagged), run through the full pipeline including pdfminer.

165 tests pass; lint is clean. New optional dependencies are confined to the
existing pdf extra (pdfminer) and dev/test tooling (reportlab).

See docex/intake/README.md for the full design, testing notes, and the
assumptions the intake makes about its inputs.

Reads CRE invoice PDFs of arbitrary layout, extracts their fields, and
reconciles them against recorded ground-truth lease actuals (base rent,
CAM, real estate tax and insurance recoveries, totals, square footage,
pro-rata share) so an overcharge is caught before it is paid.

Highlights:
- Cost-ordered extraction cascade: a free deterministic heuristic first,
  with an optional caller-provided LLM invoked only for unresolved or
  disputed fields. A clean invoice costs zero LLM calls.
- Self-improving learning loop: label phrasings confirmed against ground
  truth are recorded and promoted into the heuristic, so the cheap tier
  absorbs vendor formatting over time and the LLM is needed less.
- Tolerance-based reconciliation with per-charge comparison; verdicts are
  matched, discrepancy, incomplete, or unresolved.
- Ground-truth stores (in-memory and DocEX-backed) matched by identifier,
  plus optional embedding-similarity retrieval to find the closest lease
  when an invoice has no clean identifier. Retrieval never overrides
  reconciliation, so a wrong match surfaces as a discrepancy.
- DocEX BaseProcessor integration and an example Anthropic adapter for the
  LLM tier; the core pipeline carries no provider or database dependency.

Tested with unit coverage, randomized invoice scenarios, the learning
loop end to end, stubbed and live LLM paths, and committed positive and
negative realistic invoice PDFs under example_docs for non-technical
review.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant