Skip to content

Modernize TF-IDF scoring: BM25, IDF smoothing, small-corpus behavior #31

@chigichan24

Description

@chigichan24

Why

  • scripts/knowledge-graph/tfidf.ts is a clean textbook TF-IDF (log(1+tf) * log(n/df) + L2 norm). It works, but session corpora are heterogeneous in length (a 2-turn session vs a 200-turn session) and that hurts both retrieval and clustering quality.
  • Specific weaknesses:
    • No IDF smoothing: log(n/df) is volatile on small corpora.
    • No length normalization beyond L2: long sessions still dominate similarity scores.
    • maxDf = max(2, floor(n*0.8)) collapses to 2 whenever n < 5, which is unhealthy on a small --sessions-dir.
    • Vocabulary admission is df ≥ 2, so a single rare typo that recurs in two sessions still enters the vocabulary.

What

  • Move to BM25 scoring (or equivalently TF-IDF with sublinear TF + length normalization controlled by k1, b). BM25 is the de-facto standard for short/long document mixes.
  • Add IDF smoothing: BM25-style log((n - df + 0.5) / (df + 0.5) + 1) so small corpora do not produce wildly unstable weights.
  • Reconsider maxDf for small n: keep ratio-based logic but introduce an absolute cap that does not collapse to 2 by default.
  • Document the rationale in docs/knowledge-graph-algorithm.md (already referenced from CLAUDE.md).

How

  • Refactor buildTfidf into buildBm25 (or keep the same name and document the change in the same commit). Output remains a per-document Float64Array vector so cosine similarity downstream is unchanged.
  • Update scripts/__tests__/tfidf.test.ts to assert the new properties (length normalization, smoothed IDF behavior on tiny corpora).

Notes

  • Land the tokenizer issues first. Tokenization quality currently dominates score quality — especially for Japanese, where one session collapses into a single token. Tuning the scoring layer before tokenization is fixed is polish on a broken foundation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions