Follow-up to PR #133 (auto-categorization Phase 1).
Background
Phase 1 uses hardcoded patterns to normalize transaction names (Square, PayPal, Amazon, etc.) based on documented Plaid transaction formats. This works well for known processor prefixes and noise patterns.
The hardcoded approach has limits: novel merchant formats, regional processors, and evolving Plaid naming conventions may not be covered.
Proposed approach for Phase 2
Investigate and implement a lightweight ML-based or rule-learning approach to merchant normalization:
- Option A: Bayesian-style frequency analysis — track which normalization transformations correlate with high-confidence category assignments across users
- Option B: Pre-trained tokenizer — use an open-source rule-based tokenizer trained on financial transaction data (e.g. spaCy NER with financial corpus)
- Option C: Pattern extraction from data — analyze real transaction names in the database to discover new noise patterns automatically, periodically updating the pattern list
Acceptance criteria
- Normalization accuracy improves on edge cases not covered by Phase 1 patterns
- No regression on existing covered patterns (test suite must still pass)
- Approach is explainable / auditable (no black-box ML)
Dependencies
References
Follow-up to PR #133 (auto-categorization Phase 1).
Background
Phase 1 uses hardcoded patterns to normalize transaction names (Square, PayPal, Amazon, etc.) based on documented Plaid transaction formats. This works well for known processor prefixes and noise patterns.
The hardcoded approach has limits: novel merchant formats, regional processors, and evolving Plaid naming conventions may not be covered.
Proposed approach for Phase 2
Investigate and implement a lightweight ML-based or rule-learning approach to merchant normalization:
Acceptance criteria
Dependencies
References