Modernize TF-IDF scoring: BM25, IDF smoothing, small-corpus behavior

# Why
- `scripts/knowledge-graph/tfidf.ts` is a clean textbook TF-IDF (`log(1+tf) * log(n/df)` + L2 norm). It works, but session corpora are heterogeneous in length (a 2-turn session vs a 200-turn session) and that hurts both retrieval and clustering quality.
- Specific weaknesses:
  - **No IDF smoothing**: `log(n/df)` is volatile on small corpora.
  - **No length normalization beyond L2**: long sessions still dominate similarity scores.
  - **`maxDf = max(2, floor(n*0.8))`** collapses to `2` whenever `n < 5`, which is unhealthy on a small `--sessions-dir`.
  - **Vocabulary admission is df ≥ 2**, so a single rare typo that recurs in two sessions still enters the vocabulary.

# What
- Move to **BM25** scoring (or equivalently TF-IDF with sublinear TF + length normalization controlled by `k1`, `b`). BM25 is the de-facto standard for short/long document mixes.
- Add IDF smoothing: BM25-style `log((n - df + 0.5) / (df + 0.5) + 1)` so small corpora do not produce wildly unstable weights.
- Reconsider `maxDf` for small `n`: keep ratio-based logic but introduce an absolute cap that does not collapse to 2 by default.
- Document the rationale in `docs/knowledge-graph-algorithm.md` (already referenced from `CLAUDE.md`).

# How
- Refactor `buildTfidf` into `buildBm25` (or keep the same name and document the change in the same commit). Output remains a per-document `Float64Array` vector so cosine similarity downstream is unchanged.
- Update `scripts/__tests__/tfidf.test.ts` to assert the new properties (length normalization, smoothed IDF behavior on tiny corpora).

# Notes
- Land the tokenizer issues first. Tokenization quality currently dominates score quality — especially for Japanese, where one session collapses into a single token. Tuning the scoring layer before tokenization is fixed is polish on a broken foundation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modernize TF-IDF scoring: BM25, IDF smoothing, small-corpus behavior #31

Why

What

How

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Modernize TF-IDF scoring: BM25, IDF smoothing, small-corpus behavior #31

Description

Why

What

How

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions