feat: ML-based merchant normalization for Phase 2 auto-categorization

Follow-up to PR #133 (auto-categorization Phase 1).

## Background

Phase 1 uses hardcoded patterns to normalize transaction names (Square, PayPal, Amazon, etc.) based on documented Plaid transaction formats. This works well for known processor prefixes and noise patterns.

The hardcoded approach has limits: novel merchant formats, regional processors, and evolving Plaid naming conventions may not be covered.

## Proposed approach for Phase 2

Investigate and implement a lightweight ML-based or rule-learning approach to merchant normalization:

1. **Option A: Bayesian-style frequency analysis** — track which normalization transformations correlate with high-confidence category assignments across users
2. **Option B: Pre-trained tokenizer** — use an open-source rule-based tokenizer trained on financial transaction data (e.g. spaCy NER with financial corpus)
3. **Option C: Pattern extraction from data** — analyze real transaction names in the database to discover new noise patterns automatically, periodically updating the pattern list

## Acceptance criteria
- Normalization accuracy improves on edge cases not covered by Phase 1 patterns
- No regression on existing covered patterns (test suite must still pass)
- Approach is explainable / auditable (no black-box ML)

## Dependencies
- Depends on Phase 1 landing (PR #133)
- Requires sufficient transaction data to train/test

## References
- Phase 1 design: PR #133
- Hardcoded patterns reference: Plaid transaction naming conventions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ML-based merchant normalization for Phase 2 auto-categorization #189

Background

Proposed approach for Phase 2

Acceptance criteria

Dependencies

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: ML-based merchant normalization for Phase 2 auto-categorization #189

Description

Background

Proposed approach for Phase 2

Acceptance criteria

Dependencies

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions