Fast Arabic morphological disambiguation and stemming. Rust engine with Python bindings. Almost drop-in replacement for CAMeL Tools MLE disambiguator (support for other disambiguators., like sinatools, soon!)
Tested on the Hindawi Books dataset, Apple M1:
| Workload | fast-disambig | CAMeL Tools | Speedup |
|---|---|---|---|
| Single text | 38ms | 340ms | 9x |
| 491 book chapters (7.1M chars) | 19s | 19m 26s | 61x |
Reproduce on your machine:
uv run benchmark.pyShare your results in an issue!
pip install fast-disambigor with uv:
uv pip install fast-disambig
uv add fast-disambigRequires: CAMeL Tools data files in ~/.camel_tools/data/. If missing, they are downloaded automatically on first use. (camel_data CLI replacement tool soon!)
git clone https://github.com/hadikhamoud/fast-disambig
cd fast-disambig
pip install maturin
maturin develop --releaseimport fast_disambig
dis = fast_disambig.camel.MLEDisambiguator("calima-msa-r13")
results = dis.disambiguate(["والكتاب", "الجميل"])stemmer = fast_disambig.camel.Stemmer()
# light stemming
stemmer.stem("والكتاب الجميل")
# 'و[+]ال[+]كتاب ال[+]جميل'
stemmer.stem("وَالْكِتَابُ الْجَمِيلُ", preserve_diacritics=True)
# 'وَ[+]الْ[+]كِتَابُ الْ[+]جَمِيلُ'
stemmer.stem("والكتاب الجميل", sep="_")
# 'و_ال_كتاب ال_جميل'
stemmer.stem("والكتاب الجميل", scheme="d3seg")
# 'و[+]ال[+]كتاب ال[+]جميل'
# fallback: try d3tok first, then d3seg, then bwtok if merge fails
stemmer.stem("والكتاب الجميل", fallback=["d3seg", "bwtok"])
# 'و[+]ال[+]كتاب ال[+]جميل'Disable cache:
stemmer = fast_disambig.camel.Stemmer(cache_size=0)fast_disambig.camel.tokenize("والكتاب الجميل", "full")
# ['والكتاب', ' ', 'الجميل']
fast_disambig.camel.tokenize("Hello عالم 123!", "full")
# ['Hello', ' ', 'عالم', ' ', '123', '!']