LedgerCopilot is an AI operations platform for financial document workflows. It combines structured extraction, deterministic validation, policy enforcement, selective human review, and full audit trails to automate high-volume finance operations safely.
Portuguese version: README.pt-BR.md.
Finance teams drown in documents. Invoices (NF-e), boletos, and payment receipts arrive by email, ERP exports, spreadsheets, and shared folders. Someone opens each one, retypes the fields into a system, checks the CNPJ, compares the total against a purchase order, decides whether to pay, and files the result. The work is repetitive, error-prone, and invisible: when a wrong payment goes out, nobody can reconstruct who approved it or why.
Plain OCR does not solve this. It reads characters, but it does not decide. The hard part is not extraction, it is making a defensible operational decision on every document and keeping a record of it.
LedgerCopilot turns each incoming document into a traceable case that moves through a fixed pipeline:
classify → extract → validate → reconcile → apply policy → decide
Every case ends in one of three outcomes: auto_approve, human_review, or reject. Each outcome carries the evidence behind it: which fields were extracted, at what confidence, which validations passed or failed, which policy rules fired, and which model and prompt version produced the extraction.
What you get:
- Multi-channel intake: manual upload, email webhook, CSV/XLSX batch (one case per row), ERP/automation JSON, and a bucket-scan cron that picks up files dropped into storage.
- Per-field extraction with confidence: supplier, CNPJ, total, currency, issue and due dates, document number, line items, cost center, and category.
- Deterministic validation: amount sign, CNPJ presence/format/check-digits, date order, currency, sum-of-items vs total, cost-center membership against the org registry.
- Versioned policy: auto-approve limits, low-confidence routing, unknown-supplier checks, amount-vs-PO deltas, category justification, and dual-approval for urgent payments.
- Reconciliation: document against PO, against payment/ledger, against history (business-key dedup), plus a supplier blocklist hard-reject.
- Selective human review: only cases that genuinely need a human reach a queue, with five actions (approve, reject, edit, request more context, resend to an earlier stage).
- Full audit trail: every state transition writes an immutable
audit_event, and approvers can export a complete audit package per case.
A demo that pipes OCR text into a language model and asks it to "approve or reject" looks impressive and fails in production. LedgerCopilot is built around the opposite conviction: the language model is the least trustworthy part of the system, so it gets the least authority.
| "OCR + chatbot" | LedgerCopilot |
|---|---|
| The model decides | Deterministic code decides; the model only classifies and extracts |
| Numbers come from a probabilistic generation | CNPJ check-digits, item sums, and policy limits are pure functions |
| Document text becomes a prompt | Document content is untrusted input, sanitized before injection |
| Missing field gets a plausible guess | Missing field is null with confidence 0.0, which forces review |
| "It worked in the demo" | Every change is scored on a dataset and gated before promotion |
| No record of why | Every transition is an immutable audit event |
Concretely, five principles are non-negotiable:
- Audit is the backbone, not a feature. No case state changes without an
audit_eventin the same transaction. - Determinism before the model. Validation, dedup, CNPJ checks, totals, and policy are code, never a prompt.
- Human review prefers to escalate over guessing. When in doubt, route to a human. Auto-approving something that should have been reviewed is the worst failure mode and is tracked as such.
- A document is untrusted data. "Approve this" written inside an invoice is a prompt-injection signal, not a command.
- No invented values. An illegible field stays empty. The system never fills in a plausible CNPJ or total.
A pnpm + uv monorepo. The decision logic is pure and testable without a database or network; all I/O lives at the edges.
apps/web/ Next.js (App Router): inbox, case detail, dashboard, monitoring, prompts
apps/api/ FastAPI: auth, cases, uploads, intake channels, prompts/policies
workers/ arq jobs: the document processing pipeline + bucket-scan cron
packages/domain/ Pydantic entities + state machine + decision logic (pure)
packages/validation/ deterministic validation engine (CNPJ, date order, sums, ...)
packages/policy/ policy engine + versioning
packages/reconciliation/ reconciliation engine
packages/agents/ extraction agent with Pydantic-validated output
packages/ai_gateway/ model abstraction, prompt registry, tracing, fallback, sanitization
eval/ dataset with slices, metrics, scorecards, promotion gate
migrations/ Alembic
infra/ IaC + docker-compose for local dev
| Layer | Choice |
|---|---|
| Frontend | Next.js (App Router) + TypeScript + Tailwind + shadcn/ui + Recharts |
| API | FastAPI + Pydantic v2 + SQLAlchemy 2.0 (async) |
| Workers | arq + Redis, with a migration path to Temporal for long-running human pauses |
| Database | Postgres for transactional state |
| Storage | GCS or S3 for files, Redis for queues |
| AI layer | Internal AI Gateway (abstracts model providers) + prompt registry + tracing + model fallback |
| Auth | JWT with role-based access control (analyst, approver, admin) |
packages/{domain,validation,policy,reconciliation} contain no SQLAlchemy and no model calls. That boundary is what makes the decision logic unit-testable and the audit trail trustworthy.
The worker runs every case through the same stages, emitting an audit event at each transition:
1. Document arrives (upload | email | bucket | csv/xlsx | API).
2. Case is created; hash, source, timestamps, and pipeline version are recorded.
3. OCR / parsing.
4. Classification + extraction (self-consistency k=3 on critical fields).
5. Validation engine (deterministic).
6. Policy engine (+ risk classification).
7. Reconciliation (vs PO / payment / ledger / history).
8. Decision: auto_approve | human_review | reject.
9. An audit_event is persisted on every transition.
10. Cost, latency, and token metrics are recorded per stage.
The state machine is explicit. An edited case re-enters at extracted and revalidates, so a human correction never bypasses the deterministic checks:
received → classified → extracted → validated → reconciled → policy_evaluated → decided
decided ─┬─ auto_approved → closed
├─ rejected → closed
└─ in_human_review ─┬─ approved → closed
├─ rejected → closed
└─ edited → (back to extracted, revalidate)
The review queue is the center of the product, and it is deliberately small. Healthy cases stay quiet; only cases that need a decision surface.
When a case lands in review, the analyst sees the reasoning, not raw model output:
- Why the case was escalated, in plain language ("value mismatch (+55%) and failed validation").
- Blocking issues listed explicitly (for example, "CNPJ check digits invalid" and "cost_center not in active registry").
- Extracted fields with per-field confidence bars.
- Validation results marked
blockorwarn. - The full audit trail, showing which actor (system, agent, human) drove each transition and under which model and prompt version.
Five actions are available, scoped by role:
| Action | Effect |
|---|---|
| Approve | Close the case as approved (approver/admin) |
| Reject | Close the case as rejected (approver/admin) |
| Edit | Correct fields and re-enqueue from validated (analyst+) |
| Request more context | Add an annotation without changing status |
| Resend to stage | Re-enter the resumable pipeline at extracted or validated |
Roles are enforced on every endpoint and queries are organization-scoped. Approvers and admins can export a per-case audit package.
The AI layer is treated like production software: versioned, traced, measured, and gated.
- Prompt registry. Prompts are not inlined in code. Each version has
dev/staging/productionaliases and is referenced byprompt_version_id. Promoting toproductionwires the new system text directly into the pipeline worker. - Tracing. Every model call passes through the AI Gateway and records tokens, latency, cost, stage, model, and a redacted prompt/completion. Sensitive PII never reaches traces in the clear.
- Evaluation dataset. Slices cover the failure modes that matter:
clean_invoice,low_quality_scan,handwritten_receipt,duplicate_invoice,adversarial_formatting,supplier_unknown,value_mismatch,language_variation. - Promotion gate.
eval.gateblocks a promotion (exit code 1) when any rule is violated:false_auto_approve_rate > baseline + 1pp,critical_field_accuracy < 85%,cost/doc > baseline × 1.20, ordecision_accuracydropping more than 5pp. The same rules guardPOST /api/v1/prompts/{id}/promote.
A draft prompt that fails the gate cannot reach production, and the UI shows exactly which metric blocked it (see the screenshots below).
Sign in with seeded demo roles (analyst, approver, admin).
Inbox: every document becomes a case with status, type, and SLA.
Case review: the "why", blocking issues, per-field confidence, and the decision panel.
Deterministic validation and the immutable audit trail on the same case.
Executive dashboard: throughput, average confidence, cost per document, decision mix.
Monitoring: per-stage model runs, latency, tokens, and cost.
Prompt versions with their scorecard, gated against the production baseline.
Promotion blocked: critical field accuracy below threshold fails the gate.
Authoring a new prompt version with generation config and self-consistency k.
The dataset is organized by slice, where each slice targets a specific failure mode rather than a happy path. The metrics that drive promotion decisions are:
- exact and normalized field accuracy
missing_critical_fields_ratefalse_auto_approve_rate(the metric the whole design optimizes against)- exception routing precision
- human correction rate
- cost per document and p95 latency
Critical fields (total_amount, tax_id_cnpj, document_number) go through self-consistency k=3 and weigh into overall_confidence. supplier_name_accuracy is reported but intentionally not gated, since supplier name is not a critical field.
Run an evaluation and gate a candidate against production:
# Score a prompt version across all slices and write a scorecard
uv run python -m eval.run --prompt-version dev --out eval/scorecards/candidate.json
# Compare against the production baseline (exit 1 = blocked)
uv run python -m eval.gate \
--candidate eval/scorecards/candidate.json \
--baseline eval/scorecards/production.jsonA blocked promotion looks like this:
=== eval.gate: extraction-v2-experimental vs extraction-v1 ===
PROMOTION BLOCKED for extraction-v2-experimental:
✗ false_auto_approve_rate: 0.025 > 0.000+0.01
✗ critical_field_accuracy: 0.000 < 0.85
✗ decision_accuracy: 0.625 < 0.700 (baseline-0.05)
Exit code: 1
The dataset currently ships one fixture per slice, which is enough to demonstrate the gate end to end. Expanding it is the obvious next step for statistical significance.
Every design choice here has a cost, and naming it is part of the point.
- Determinism over flexibility. Putting CNPJ checks, sums, and policy in code makes the system predictable and auditable, but it means new rules require a code change and a migration, not a prompt edit. That is intentional: rules that move money should not live in a prompt.
- Escalation over automation rate. Preferring
human_reviewwhen in doubt keepsfalse_auto_approve_ratelow at the cost of a lower auto-approval rate. For finance operations, a missed review is far more expensive than an extra one. - arq + Redis over Temporal, for now. arq covers the current pipeline cleanly. Long-running workflows with multi-day human pauses will need Temporal; the architecture leaves room for that migration without rewriting the domain logic.
- Synchronous-feeling pipeline over event sourcing everywhere. The audit log gives lineage without the full operational weight of an event-sourced system. If reconciliation grows more complex, that boundary may need to move.
- One fixture per slice over a large benchmark. This proves the LLMOps loop works, not that the model is production-accurate. The dataset is the first thing to grow before a real deployment.
The product was built in four phases, each gated by a concrete success criterion.
| Phase | Scope | Status |
|---|---|---|
| 1. Core MVP | Upload, classification, extraction, validation, case detail, review queue, audit events | Complete |
| 2. Workflow intelligence | Policy engine, per-field confidence, approve/reject/edit, reconciliation | Complete |
| 3. LLMOps layer | Eval framework, gate CLI, scorecards, prompt registry wired to runtime, tracing | Complete |
| 4. Enterprise polish | JWT + RBAC, org-scoped queries, dashboards, audit export, all five intake channels | Complete |
Next, in priority order:
- Expand the evaluation dataset beyond one fixture per slice.
- Migrate long-running, human-paused workflows to Temporal.
- Deepen reconciliation against real ERP reference data.
- Add the conversational review assistant (ask "why was this escalated?" and get a traceable answer).
Requirements: Python 3.12+, Node 20+, uv, pnpm, Docker.
corepack enable pnpm # if not already enabled
uv sync # Python deps
pnpm install # frontend deps
docker compose -f infra/docker-compose.dev.yml up -d # postgres + redis + storage
uv run alembic upgrade head # migrationsRun the three processes:
uv run uvicorn apps.api.main:app --reload # API at :8000
uv run arq workers.pipeline.WorkerSettings # pipeline worker
pnpm --filter web dev # frontend at :3000Quality gates (run before considering a task done):
make check # ruff + mypy + pytest + web lint/typecheckThree users are seeded at API startup (password demo123):
| Role | Permissions | |
|---|---|---|
analyst@demo.com |
analyst | Read cases, submit edit reviews |
approver@demo.com |
approver | Analyst access + approve/reject + audit export |
admin@demo.com |
admin | Full access + create/promote prompts + dashboard |
# Login and capture a JWT
TOKEN=$(curl -s -X POST http://localhost:8000/api/v1/auth/login \
-H "Content-Type: application/json" \
-d '{"email":"admin@demo.com","password":"demo123"}' | jq -r .access_token)
# Upload a document
curl -s -X POST http://localhost:8000/api/v1/documents \
-H "Authorization: Bearer $TOKEN" \
-F "file=@/path/to/invoice.pdf"
# List cases (organization-scoped)
curl -s -H "Authorization: Bearer $TOKEN" http://localhost:8000/api/v1/cases







