Agentic financial leakage auditing platform for transaction review, anomaly detection, approval-gated cleanup, and GST-oriented finance workflows.
Built with FastAPI + LangGraph + SQLModel, with a practical focus on Indian SMB finance operations.
Sentinel-Fi is designed as a portfolio-grade systems project, not just a notebook or single-model demo. It combines:
- workflow orchestration for audit and cleanup decisions
- structured backend services with persistence and migrations
- multiple classification strategies and fallbacks
- report generation and operational controls
- a large automated test suite for confidence in behavior
Audit: ingest statements and identify likely leakage, anomalies, and tax issuesCleanup: prepare approval-gated follow-up actions for finance operationsStrategy: provide a foundation for broader finance automation workflows
- Ingestion connectors for
CSV,PDF,Stripe,Razorpay - OCR fallback for scanned PDF statements (PaddleOCR + pypdfium2, optional dependency)
- PII scrubbing and transaction normalization
- UPI-aware signal extraction (VPA/UTR heuristics, P2P vs P2M hints)
- Taxonomy-driven intelligence using project-owned taxonomy (
data/taxonomy_base.yaml) + Sentinel overrides - Taxonomy-aware multiclass ML classifier path (primary), mapped to business/personal audit decisions
- Conditional LangGraph routing (skips fallback/LLM nodes when no work is routed)
- LangGraph
Audit Graphwith specialized nodes:- Data Ingestor
- MCC Classifier (deterministic early exit)
- ML Classifier (primary classification engine)
- Routing Supervisor (BGE-M3 + heuristic fallback)
- Rule-Based Classifier (fast deterministic fallback path)
- LLM Reasoner (OpenAI with strict JSON contract + fallback)
- Fast-mode escalation policy (ML/rule-fallback uncertain cases only)
- Leak Detector
- GST Sentinel
- Cleanup Planner
- LangGraph
Cleanup Graphwith approval gate and write execution stage - Leak detection for:
- duplicate subscriptions
- zombie subscriptions
- price hikes
- SaaS sprawl
- tax miscategorization
- GST anomaly detection and missed ITC estimation
- Auto-generated client reports:
- Markdown report
- professional PDF report
- PostgreSQL-first persistence with Alembic migrations
- Explainability traces per transaction (
classification_decisions) - Active learning loop: correction ingestion + continuous retraining
- Probability calibration for ML confidence reliability (Platt scaling)
- Drift monitoring on confidence/data distribution (
/v1/runtime/stats) - Runtime service metrics (
/v1/runtime/stats, includes drift signals) - API key authentication middleware (
ENABLE_API_KEY_AUTH,API_KEYS_CSV) - Per-client sliding-window rate limiting for
/v1/*APIs (in-memory or Redis-backed) - Secure upload handling (UUID filenames, path traversal-safe, max-size enforcement)
- Local statement ingestion allowlist for CSV/PDF paths (
LOCAL_INGESTION_ROOTS_CSV) - Async API handlers with threadpool boundaries for blocking DB/LLM/file operations
- Restart recovery for queued/running audit and ML retraining jobs persisted in DB
- Prometheus metrics endpoint at
/metrics - Optional OpenTelemetry tracing (
OTEL_ENABLED=true) - Lifespan-based startup + readiness checks (DB + ML model availability)
- Integration and unit tests
- Dockerized API deployment
Input (CSV/PDF/APIs)
-> Data Ingestor (clean + scrub PII)
-> MCC Classifier (early deterministic path)
-> ML Classifier (primary path)
-> Routing Supervisor (BGE-M3)
-> Rule-Based Classifier (fallback path)
-> LLM Reasoner (deep path + escalations)
-> Leak Detector
-> GST Sentinel
-> Cleanup Planner
-> Persist + Generate Reports
Cleanup Graph (separate, paid tier)
-> Approval Gate
-> Execute write actions (ledger/email/invoice/GST recon)
Reference integration details: docs/REFERENCE_INTEGRATION.md
- Run the sample audit end to end with
uv run python scripts/run_sample_audit.py - Start the API locally with
make runserver - Open the control room UI at
http://localhost:8000/ - Inspect the generated markdown and PDF artifacts under
output/
First time setup with uv (creates .venv, installs deps, allocates free non-standard ports, builds + runs docker):
bash scripts/setup_first_time.shRegular run (re-checks running ports, allocates free non-standard ports, starts stack):
bash scripts/run_regular.shBoth scripts generate .env.runtime with host ports:
HOST_API_PORT(default preference18000)HOST_POSTGRES_PORT(default preference15432)
uv syncFor scanned-statement OCR support:
uv sync --extra ocrFor tests/lint:
uv sync --dev --extra ocr
# Optional local hooks
uv run pre-commit installcp .env.example .envSet OPENAI_API_KEY if you want live LLM reasoning.
Set DATABASE_URL to your Postgres instance (see .env.example).
make migratemake migrate # alembic upgrade head
make sync # fast-forward local branch from origin/mainmake runserverOpen control room UI: http://localhost:8000/
Open API docs: http://localhost:8000/docs
make devFrontend dev URL: http://127.0.0.1:5173 (served separately, API calls go to http://127.0.0.1:8000/v1).
Run infra-only (default):
docker compose upRun full Docker stack (includes API container):
docker compose --profile app up --buildDocker host ports are configurable and non-standard by default:
- API profile:
${HOST_API_PORT:-18000} - Postgres:
${HOST_POSTGRES_PORT:-15432}
uv run python scripts/run_sample_audit.pyOutput artifacts:
output/reports/<audit_id>.mdoutput/pdf/<audit_id>.pdf
uv run python scripts/evaluate_bge_m3.py --input data/upi_classifier_eval.csv --model BAAI/bge-m3If sentence-transformers/BGE model is unavailable, the script falls back to a TF-IDF baseline and still produces:
- accuracy
- confusion matrix
- misclassification table
output/reports/classifier_eval_results.csv
This training pipeline merges:
- external internet datasets from HuggingFace:
Andyrasika/bank_transactionsalokkulkarni/financial_Transactionsrajeshradhakrishnan/fin-transaction-category
- bundled local bootstrap training data (
data/training/bootstrap_transactions.jsonl) - local UPI seed labels
- automatic candidate model selection (
LogisticRegression,SGDClassifier,ComplementNB) - dataset integrity enforcement via
data/external/dataset_manifest.json(URL + SHA-256 + size pinning)
uv run python scripts/train_ml_classifier.py --feedback-path data/feedback/corrections.jsonlOutputs:
models/transaction_ml_classifier.jobliboutput/reports/ml_training_metrics.json
POST /v1/audit/upload- upload CSV/PDF statementPOST /v1/audit/run- direct audit execution for small workloadsPOST /v1/audit/submit- async audit job submissionGET /v1/audit/jobs/{job_id}- async audit job status/result pollingGET /v1/audit/jobs- list recent audit jobsGET /v1/audits- list recent audit runsPOST /v1/cleanup/run- execute approved cleanup tasks for an existing audit (audit_id,approved_task_ids)POST /v1/ml/feedback- ingest corrected labels (active learning)POST /v1/ml/retrain- force retraining nowGET /v1/ml/status- feedback/retraining statusGET /v1/admin/settings- admin-only runtime settings viewPUT /v1/admin/settings- admin-only runtime settings updateGET /v1/runtime/stats- rolling runtime/quality + ML drift metricsGET /healthzGET /readyzGET /- Sentinel-Fi Control Room UI
When ENABLE_API_KEY_AUTH=true, all /v1/* endpoints require header x-api-key.
/v1/admin/* endpoints always require admin key from ADMIN_API_KEYS_CSV.
Rate limiting backend can be set with RATE_LIMIT_BACKEND=memory|redis|auto and REDIS_URL=....
Local CSV/PDF ingestion is restricted to LOCAL_INGESTION_ROOTS_CSV (default: data/uploads,data).
Scanned PDF OCR fallback can be toggled via ENABLE_PDF_OCR_FALLBACK and language set with PDF_OCR_LANG.
Cleanup execution supports real integrations when enabled:
CLEANUP_LIVE_MODE=true- SMTP send for
email_draftviaCLEANUP_EMAIL_SMTP_* - Webhook execution for
ledger_reclass,invoice_fetch,gst_reconvia:CLEANUP_LEDGER_WEBHOOK_URLCLEANUP_INVOICE_WEBHOOK_URLCLEANUP_GST_WEBHOOK_URL
- Public-safe copy reviewed before publication
- Full automated test suite re-run successfully
- Repository metadata polished for public presentation
{
"source_type": "csv",
"source_path": "data/sample_transactions.csv",
"source_config": {},
"client_name": "Demo SMB",
"report_period": "Nov 2025",
"generate_pdf": true,
"generate_markdown": true
}Async flow:
POST /v1/audit/submitwith the same payload.- Poll
GET /v1/audit/jobs/{job_id}untilstatusissucceededorfailed.
src/sentinelfi/
api/ FastAPI app + schemas
agents/ rule-based, LLM, GST, cleanup planning
connectors/ CSV/PDF/Stripe/Razorpay ingestion
core/ settings + logging
domain/ typed models + graph state
graph/ langgraph workflows
reports/ markdown/pdf report builders
repositories/ SQLModel persistence
services/ orchestration, routing, detection
scripts/
run_sample_audit.py
evaluate_bge_m3.py
train_ml_classifier.py
data/
sample_transactions.csv
taxonomy_base.yaml
taxonomy_overrides.yaml
upi_classifier_eval.csv
alembic/
env.py
versions/
tests/
- Replace rule-based SLM with local Phi-4/Mistral inference service
- Add idempotency keys and audit trails for cleanup write actions
- Add RBAC, tenant isolation, encryption-at-rest, and secret manager integration
- Add job queue (Celery/Arq) for heavy PDF/API sync workloads
- Add Prometheus/OpenTelemetry instrumentation and alerts
- Add contract tests for bank/PDF parsers per institution