Open-source, offline-first Python tool to parse credit card PDF statements from Indian banks into structured data.
Install · Quick Start · Supported Banks · Dashboard · Docs
Indian bank credit card statements are password-protected PDFs with inconsistent formats — making expense tracking painful. StmtForge solves this:
- Parse PDF statements from Indian banks (Currently implemented: HDFC, ICICI, SBI, Axis, Kotak, Yes, CSB, Federal, IDFC First)
- 100% offline — no data leaves your machine, no cloud APIs, no telemetry
- Hybrid extraction — deterministic regex parsers → table extraction → OCR → local LLM (Ollama)
- One command —
pip install stmtforgeand start analyzing your credit card spend
Built for anyone in India who wants to track credit card expenses without trusting third-party apps with their financial data.
StmtForge includes a Streamlit analytics dashboard with interactive charts, filters, and CSV export.
Analytics: total spend, monthly trends, category breakdown, top merchants, bank & card comparison, daily heatmap, drill-downs
| Feature | Description |
|---|---|
| 9 bank-specific parsers | Dedicated parsers for HDFC, ICICI, SBI, Axis, Kotak, Yes, CSB, Federal, IDFC First |
| PDF unlock & parse | Auto-decrypts password-protected statements (DOB, PAN, custom patterns) |
| Hybrid extraction pipeline | Deterministic → table → OCR → local LLM fallback chain |
| Local LLM via Ollama | Qwen / Mistral / Llama3 for unstructured statement parsing |
| Gmail auto-fetch | Read-only OAuth2 — downloads statement PDFs from Gmail automatically |
| Multi-card tracking | Track spend across multiple cards and banks |
| Auto-categorization | Rule-based merchant classification (Shopping, Food, Travel, EMI, etc.) |
| Transaction deduplication | Hash-based dedup with incremental processing |
| Streamlit dashboard | Interactive Plotly charts, sidebar filters, CSV export |
| Privacy-first design | PII redacted from logs, HMAC pseudonymization, DPDP-aligned |
| CLI interface | stmtforge run, stmtforge dashboard, stmtforge init |
pip install stmtforgeOptional extras:
pip install "stmtforge[gmail]" # Gmail fetch support
pip install "stmtforge[ocr]" # OCR fallback support
pip install "stmtforge[all]" # Gmail + OCR extrasgit clone https://github.com/madhav921/stmt-forge.git
cd stmt-forge
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS / Linux
pip install -e ".[dev]" # developer tools only
pip install -e ".[dev,all]" # developer tools + Gmail + OCR extras| Requirement | Purpose |
|---|---|
| Python 3.11+ | Runtime |
| Ollama (optional) | Local LLM for unstructured PDF parsing |
| Google Cloud project (optional) | Gmail API — not needed for manual PDF import |
| Tesseract OCR binary (optional) | Required by OCR fallback when stmtforge[ocr] is installed |
| qpdf (optional) | Fallback PDF decryption |
# 1. Set up project
mkdir ~/my-statements && cd ~/my-statements
stmtforge init # creates config.yaml, .env.example, data/
# 2. Configure PDF passwords
cp .env.example .env # then edit .env with your passwords
# 3. (Optional) Set up local LLM
ollama pull qwen2.5:3b
# 4. Run the pipeline
stmtforge run --local # parse local PDFs
stmtforge run --full # Gmail fetch + parse
stmtforge run --folder path/to/pdfs # specific folder
# 5. View insights
stmtforge dashboardManual PDF import: Drop PDFs into data/raw_pdfs/<bank>/ and run stmtforge run --local. No Gmail setup needed.
| Bank | Parser | Card Detection |
|---|---|---|
| HDFC Bank | hdfc_parser |
Swiggy, Tata Neu, Millennia, etc. |
| ICICI Bank | icici_parser |
Amazon Pay, Coral, Platinum, etc. |
| SBI Card | sbi_parser |
Cashback, Elite, SimplyCLICK, etc. |
| Axis Bank | axis_parser |
Neo, Flipkart, Ace, etc. |
| Kotak Mahindra | kotak_parser |
811, League Platinum, etc. |
| Yes Bank | yes_parser |
Marquee, Prosperity, etc. |
| CSB Bank | csb_parser |
Edge, etc. |
| Federal Bank | federal_parser |
Signet, Scapia, etc. |
| IDFC First Bank | idfc_first_parser |
First Select, Classic, WOW, etc. |
| Any other bank | generic_parser + LLM |
Auto-detected |
Statement formats change over time. Open an issue if a parser produces incorrect results.
PDF → Unlock → Bank Parser → Table Extraction → OCR → LLM → Validate → SQLite → Dashboard
- PDF Unlock — Tries password combos (DOB, PAN, custom) via pikepdf
- Bank Parser — Bank-specific regex parser extracts transactions directly
- Fallback Chain — Table extraction (pdfplumber) → Layout text → OCR (Tesseract) → Local LLM (Ollama)
- Validation — Date normalization, amount bounds, dedup, confidence scoring
- Categorization — Rule-based merchant → category mapping
- Storage — SQLite with transaction-level deduplication and incremental processing
StmtForge is built around a local-first, zero-upload architecture.
| Processing | 100% local — no cloud, no external APIs |
| Storage | Local SQLite + local files only |
| Telemetry | None — no analytics, no phone-home |
| Log privacy | PII auto-redacted (emails, phones, PAN, card numbers) |
| PDF passwords | .env → memory only; never logged or stored in DB |
| Gmail | Optional, read-only OAuth2; revoke anytime at Google Permissions |
See SECURITY.md for vulnerability reporting and full security policy.
stmtforge init creates a config.yaml with these sections:
| Section | Purpose |
|---|---|
gmail |
Sender domains, search keywords, attachment filters |
credit_cards |
Your banks and card names |
pdf_passwords |
Password patterns (from .env) |
parsers |
Email/filename → bank mapping, card identifiers |
categories |
Merchant → category rules |
database |
SQLite path |
llm |
Ollama model, URL, temperature |
from stmtforge.parsers.base_parser import BaseParser, parse_date, parse_amount
class MyBankParser(BaseParser):
BANK_NAME = "mybank"
def parse(self, pdf_path):
records = [...] # Extract transactions
return self._get_standard_df(records)Register in src/stmtforge/parsers/registry.py and add mappings in config_template.yaml.
See CONTRIBUTING.md for details.
stmt-forge/
├── src/stmtforge/ # Package source
│ ├── cli.py # CLI entry point
│ ├── run_pipeline.py # Pipeline orchestrator
│ ├── hybrid_pipeline.py # Hybrid extraction engine
│ ├── parsers/ # 9 bank parsers + generic + categorizer
│ ├── dashboard/ # Streamlit analytics app
│ ├── pdf_processing/ # PDF unlock & text extraction
│ ├── llm/ # Ollama client & prompts
│ ├── gmail/ # Gmail OAuth & fetcher
│ ├── database/ # SQLite layer
│ ├── validator/ # Transaction validation
│ └── utils/ # Config, logging, privacy, hashing
├── tests/ # Test suite
├── pyproject.toml # Build config
└── README.md
Bug reports, new bank parsers, and code fixes welcome. See CONTRIBUTING.md.