Skip to content

feat: ML-based merchant normalization for Phase 2 auto-categorization #189

@moltboie

Description

@moltboie

Follow-up to PR #133 (auto-categorization Phase 1).

Background

Phase 1 uses hardcoded patterns to normalize transaction names (Square, PayPal, Amazon, etc.) based on documented Plaid transaction formats. This works well for known processor prefixes and noise patterns.

The hardcoded approach has limits: novel merchant formats, regional processors, and evolving Plaid naming conventions may not be covered.

Proposed approach for Phase 2

Investigate and implement a lightweight ML-based or rule-learning approach to merchant normalization:

  1. Option A: Bayesian-style frequency analysis — track which normalization transformations correlate with high-confidence category assignments across users
  2. Option B: Pre-trained tokenizer — use an open-source rule-based tokenizer trained on financial transaction data (e.g. spaCy NER with financial corpus)
  3. Option C: Pattern extraction from data — analyze real transaction names in the database to discover new noise patterns automatically, periodically updating the pattern list

Acceptance criteria

  • Normalization accuracy improves on edge cases not covered by Phase 1 patterns
  • No regression on existing covered patterns (test suite must still pass)
  • Approach is explainable / auditable (no black-box ML)

Dependencies

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions