You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
scripts/knowledge-graph/tfidf.ts is a clean textbook TF-IDF (log(1+tf) * log(n/df) + L2 norm). It works, but session corpora are heterogeneous in length (a 2-turn session vs a 200-turn session) and that hurts both retrieval and clustering quality.
Specific weaknesses:
No IDF smoothing: log(n/df) is volatile on small corpora.
No length normalization beyond L2: long sessions still dominate similarity scores.
maxDf = max(2, floor(n*0.8)) collapses to 2 whenever n < 5, which is unhealthy on a small --sessions-dir.
Vocabulary admission is df ≥ 2, so a single rare typo that recurs in two sessions still enters the vocabulary.
What
Move to BM25 scoring (or equivalently TF-IDF with sublinear TF + length normalization controlled by k1, b). BM25 is the de-facto standard for short/long document mixes.
Add IDF smoothing: BM25-style log((n - df + 0.5) / (df + 0.5) + 1) so small corpora do not produce wildly unstable weights.
Reconsider maxDf for small n: keep ratio-based logic but introduce an absolute cap that does not collapse to 2 by default.
Document the rationale in docs/knowledge-graph-algorithm.md (already referenced from CLAUDE.md).
How
Refactor buildTfidf into buildBm25 (or keep the same name and document the change in the same commit). Output remains a per-document Float64Array vector so cosine similarity downstream is unchanged.
Update scripts/__tests__/tfidf.test.ts to assert the new properties (length normalization, smoothed IDF behavior on tiny corpora).
Notes
Land the tokenizer issues first. Tokenization quality currently dominates score quality — especially for Japanese, where one session collapses into a single token. Tuning the scoring layer before tokenization is fixed is polish on a broken foundation.
Why
scripts/knowledge-graph/tfidf.tsis a clean textbook TF-IDF (log(1+tf) * log(n/df)+ L2 norm). It works, but session corpora are heterogeneous in length (a 2-turn session vs a 200-turn session) and that hurts both retrieval and clustering quality.log(n/df)is volatile on small corpora.maxDf = max(2, floor(n*0.8))collapses to2whenevern < 5, which is unhealthy on a small--sessions-dir.What
k1,b). BM25 is the de-facto standard for short/long document mixes.log((n - df + 0.5) / (df + 0.5) + 1)so small corpora do not produce wildly unstable weights.maxDffor smalln: keep ratio-based logic but introduce an absolute cap that does not collapse to 2 by default.docs/knowledge-graph-algorithm.md(already referenced fromCLAUDE.md).How
buildTfidfintobuildBm25(or keep the same name and document the change in the same commit). Output remains a per-documentFloat64Arrayvector so cosine similarity downstream is unchanged.scripts/__tests__/tfidf.test.tsto assert the new properties (length normalization, smoothed IDF behavior on tiny corpora).Notes