@nlptools/distance

High-performance string distance and similarity algorithms, implemented in pure TypeScript

Features

Pure TypeScript implementation, zero native dependencies
Edit distance: Levenshtein, LCS (Myers O(ND) and DP)
Token similarity: Jaccard, Cosine, Sorensen-Dice (character multiset and n-gram variants)
Hash-based deduplication: SimHash, MinHash, LSH
Fuzzy search: FuzzySearch class and findBestMatch with multi-algorithm support
Diff: based on @algorithm.ts/diff (Myers and DP backends)
All distance algorithms include normalized similarity variants (0-1 range)

Installation

# Install with npm
npm install @nlptools/distance

# Install with yarn
yarn add @nlptools/distance

# Install with pnpm
pnpm add @nlptools/distance

Usage

Edit Distance

import { levenshtein, levenshteinNormalized } from "@nlptools/distance";

levenshtein("kitten", "sitting"); // 3
levenshteinNormalized("kitten", "sitting"); // 0.571

LCS (Longest Common Subsequence)

import { lcsDistance, lcsNormalized, lcsLength, lcsPairs } from "@nlptools/distance";

lcsDistance("abcde", "ace"); // 2  (= 5 + 3 - 2 * 3)
lcsNormalized("abcde", "ace"); // 0.75
lcsLength("abcde", "ace"); // 3
lcsPairs("abcde", "ace"); // [[0,0], [2,1], [4,2]]

By default uses Myers O(ND) algorithm. Switch to DP with algorithm: "dp".

Token Similarity (Character Multiset)

Based on character frequency maps (Counter), matching the textdistance crate semantics:

import { jaccard, cosine, sorensen } from "@nlptools/distance";

jaccard("abc", "abd"); // 0.667
cosine("hello", "hallo"); // 0.8
sorensen("test", "text"); // 0.75

N-Gram Variants

import { jaccardNgram, cosineNgram, sorensenNgram } from "@nlptools/distance";

jaccardNgram("hello", "hallo"); // 0.333  (bigram-based)
cosineNgram("hello", "hallo"); // 0.5     (bigram-based)

SimHash (Document Fingerprinting)

import { simhash, hammingDistance, SimHasher } from "@nlptools/distance";

// Function-based
const fp1 = simhash(["hello", "world"]);
const fp2 = simhash(["hello", "earth"]);
hammingDistance(fp1, fp2); // small = similar

// Class-based
const hasher = new SimHasher();
const a = hasher.hash(["hello", "world"]);
const b = hasher.hash(["hello", "earth"]);
hasher.isDuplicate(a, b); // true if hamming distance <= 3

MinHash (Jaccard Similarity Estimation)

import { MinHash } from "@nlptools/distance";

const mh1 = new MinHash({ numHashes: 128 });
mh1.update("hello");
mh1.update("world");

const mh2 = new MinHash({ numHashes: 128 });
mh2.update("hello");
mh2.update("earth");

MinHash.estimate(mh1.digest(), mh2.digest()); // ~0.67

LSH (Approximate Nearest Neighbor Search)

import { MinHash } from "@nlptools/distance";
import { LSH } from "@nlptools/distance";

const lsh = new LSH({ numBands: 16, numHashes: 128 });

const mh = new MinHash({ numHashes: 128 });
mh.update("hello");
mh.update("world");
lsh.insert("doc1", mh.digest());

// Query for similar documents
const query = lsh.query(mh.digest(), 0.5);
// => [["doc1", 0.67]]

Fuzzy Search

import { FuzzySearch, findBestMatch } from "@nlptools/distance";

// String array search
const search = new FuzzySearch(["apple", "banana", "cherry"]);
search.search("aple");
// => [{ item: "apple", score: 0.8, index: 0 }]

// Object array with weighted keys
const books = [
  { title: "Old Man's War", author: "John Scalzi" },
  { title: "The Great Gatsby", author: "F. Scott Fitzgerald" },
];
const bookSearch = new FuzzySearch(books, {
  keys: [
    { name: "title", weight: 0.7 },
    { name: "author", weight: 0.3 },
  ],
  algorithm: "cosine",
  threshold: 0.3,
});
bookSearch.search("old man");
// => [{ item: { title: "Old Man's War", ... }, score: 0.52, index: 0 }]

// One-shot best match
findBestMatch("kitten", ["sitting", "kit", "mitten"]);
// => { item: "kit", score: 0.5, index: 1 }

// With per-key details
const detailed = new FuzzySearch(books, {
  keys: [{ name: "title" }, { name: "author" }],
  includeMatchDetails: true,
});
detailed.search("gatsby");
// => [{ item: ..., score: 0.45, index: 1, matches: { title: 0.6, author: 0.1 } }]

Diff

import { diff, DiffType } from "@nlptools/distance";

const result = diff("abc", "ac");
// => [
//   { type: DiffType.COMMON, tokens: "a" },
//   { type: DiffType.REMOVED, tokens: "b" },
//   { type: DiffType.COMMON, tokens: "c" },
// ]

API Reference

Edit Distance

Function	Description	Returns
`levenshtein(a, b)`	Levenshtein edit distance	`number`
`levenshteinNormalized(a, b)`	Normalized similarity	`number` (0-1)
`lcsDistance(a, b, algorithm?)`	LCS distance	`number`
`lcsNormalized(a, b, algorithm?)`	Normalized LCS similarity	`number` (0-1)
`lcsLength(a, b, algorithm?)`	LCS length	`number`
`lcsPairs(a, b, algorithm?)`	LCS matching pairs	`[number, number][]`

Token Similarity

Function	Description	Returns
`jaccard(a, b)`	Jaccard similarity (character multiset)	`number` (0-1)
`jaccardNgram(a, b, n?)`	Jaccard on character n-grams	`number` (0-1)
`cosine(a, b)`	Cosine similarity (character multiset)	`number` (0-1)
`cosineNgram(a, b, n?)`	Cosine on character n-grams	`number` (0-1)
`sorensen(a, b)`	Sorensen-Dice coefficient (character multiset)	`number` (0-1)
`sorensenNgram(a, b, n?)`	Sorensen-Dice on character n-grams	`number` (0-1)

Hash-Based Deduplication

Function / Class	Description
`simhash(features, options?)`	Generate 64-bit fingerprint as `bigint`
`hammingDistance(a, b)`	Hamming distance between two fingerprints
`hammingSimilarity(a, b, bits?)`	Normalized Hamming similarity
`SimHasher`	Class with `hash()`, `distance()`, `similarity()`, `isDuplicate()`
`MinHash`	Class with `update()`, `digest()`, `estimate()`
`MinHash.estimate(sig1, sig2)`	Static: estimate Jaccard from signatures
`LSH`	Class with `insert()`, `query()`, `remove()`

Fuzzy Search

Function / Class	Description
`FuzzySearch<T>(collection, options?)`	Search engine with dynamic collection management
`findBestMatch(query, collection, options?)`	One-shot convenience: returns best match or `null`

FuzzySearch options:

Option	Type	Default	Description
`algorithm`	`BuiltinAlgorithm \| SimilarityFn`	`"levenshtein"`	Similarity algorithm to use
`keys`	`ISearchKey[]`	`[]`	Object fields to search on
`threshold`	`number`	`0`	Min similarity score (0-1)
`limit`	`number`	`Infinity`	Max results to return
`caseSensitive`	`boolean`	`false`	Case-insensitive by default
`includeMatchDetails`	`boolean`	`false`	Include per-key scores
`lsh`	`{ numHashes?, numBands? }`	—	Enable LSH for large datasets

Built-in algorithms: "levenshtein", "lcs", "jaccard", "jaccardNgram", "cosine", "cosineNgram", "sorensen", "sorensenNgram"

Diff

Function	Description	Returns
`diff(a, b, options?)`	Sequence diff (Myers or DP)	`IDiffItem<T>[]`

Types

Type	Description
`DiffType`	Enum: `ADDED`, `REMOVED`, `COMMON`
`IDiffItem<T>`	Diff result item with type and tokens
`IDiffOptions<T>`	Options for diff (equals, lcs algorithm)
`ISimHashOptions`	Options for SimHash (bits, hashFn)
`IMinHashOptions`	Options for MinHash (numHashes, seed)
`ILSHOptions`	Options for LSH (numBands, numHashes)
`IFuzzySearchOptions`	Options for FuzzySearch constructor
`IFindBestMatchOptions`	Options for findBestMatch function
`ISearchKey`	Searchable key config (name, weight, getter)
`ISearchResult<T>`	Search result with item, score, index
`SimilarityFn`	`(a: string, b: string) => number` in [0,1]

Performance

Benchmark: same test data across all runtimes. TS/WASM via vitest bench (V8 JIT), Rust via cargo test --release. Unit: microseconds per operation (us/op).

Edit Distance

Algorithm	Size	TS (V8 JIT)	WASM (via JS)	Rust (native)
levenshtein	Short (<10)	0.3	1.0	0.24
levenshtein	Medium (10-100)	1.3	4.8	2.00
levenshtein	Long (>200)	13.9	102.3	61.77
levenshteinNorm	Short	0.3	1.0	0.19
lcs	Short (<10)	1.7	1.9	0.69
lcs	Medium (10-100)	6.8	10.1	7.70
lcs	Long (>200)	216.0	161.8	151.84
lcsNorm	Short	1.7	1.9	0.42

Token Similarity (Character Multiset)

Algorithm	Size	TS (V8 JIT)	WASM (via JS)	Rust (native)
jaccard	Short (<10)	0.8	3.4	0.63
jaccard	Medium (10-100)	0.8	8.6	2.67
jaccard	Long (>200)	1.5	18.9	7.25
cosine	Short (<10)	1.0	2.6	0.43
cosine	Medium (10-100)	0.8	7.0	1.56
cosine	Long (>200)	1.7	17.2	6.23
sorensen	Short (<10)	0.7	2.6	0.56
sorensen	Medium (10-100)	0.7	7.0	2.27
sorensen	Long (>200)	1.4	17.4	6.48

Bigram Variants

Algorithm	Size	TS (V8 JIT)	WASM (via JS)	Rust (native)
jaccardBigram	Short (<10)	1.1	3.5	0.67
jaccardBigram	Medium (10-100)	7.5	18.1	4.80
cosineBigram	Short (<10)	0.7	2.8	0.43
cosineBigram	Medium (10-100)	5.4	14.0	4.04

TS implementations use Int32Array ASCII fast path + integer-encoded bigrams, avoiding JS-WASM boundary overhead. For compute-heavy algorithms on long strings (e.g. LCS), WASM via JS and Rust native can outperform TS due to native computation advantage outweighing the boundary cost.

Fuzzy Search: NLPTools vs Fuse.js

Benchmark: 20 items in collection, 6 queries per iteration, 1000 iterations. Unit: milliseconds per operation (ms/op). Algorithm: levenshtein (default).

Scenario	NLPTools	Fuse.js
Setup (constructor)	0.0002	0.0050
Search (string array)	0.0114	0.1077
Search (object, 1 key)	0.0176	0.3308
Search (object, 2 keys)	0.0289	0.6445

Both libraries return identical top-1 results for all test queries. NLPTools scores are normalized similarity (0-1, higher is better); Fuse.js uses Bitap error scores (0 = perfect, lower is better).

Dependencies

fastest-levenshtein — fastest JS Levenshtein implementation
@algorithm.ts/lcs — Myers and DP Longest Common Subsequence
@algorithm.ts/diff — Sequence diff built on LCS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

@nlptools/distance

Features

Installation

Usage

Edit Distance

LCS (Longest Common Subsequence)

Token Similarity (Character Multiset)

N-Gram Variants

SimHash (Document Fingerprinting)

MinHash (Jaccard Similarity Estimation)

LSH (Approximate Nearest Neighbor Search)

Fuzzy Search

Diff

API Reference

Edit Distance

Token Similarity

Hash-Based Deduplication

Fuzzy Search

Diff

Types

Performance

Edit Distance

Token Similarity (Character Multiset)

Bigram Variants

Fuzzy Search: NLPTools vs Fuse.js

Dependencies

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

@nlptools/distance

Features

Installation

Usage

Edit Distance

LCS (Longest Common Subsequence)

Token Similarity (Character Multiset)

N-Gram Variants

SimHash (Document Fingerprinting)

MinHash (Jaccard Similarity Estimation)

LSH (Approximate Nearest Neighbor Search)

Fuzzy Search

Diff

API Reference

Edit Distance

Token Similarity

Hash-Based Deduplication

Fuzzy Search

Diff

Types

Performance

Edit Distance

Token Similarity (Character Multiset)

Bigram Variants

Fuzzy Search: NLPTools vs Fuse.js

Dependencies

License