Skip to content

hadikhamoud/fast-disambig

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fast-disambig

Fast Arabic morphological disambiguation and stemming. Rust engine with Python bindings. Almost drop-in replacement for CAMeL Tools MLE disambiguator (support for other disambiguators., like sinatools, soon!)

Benchmark

Tested on the Hindawi Books dataset, Apple M1:

Workload fast-disambig CAMeL Tools Speedup
Single text 38ms 340ms 9x
491 book chapters (7.1M chars) 19s 19m 26s 61x

Reproduce on your machine:

uv run benchmark.py

Share your results in an issue!

Install

pip install fast-disambig

or with uv:

uv pip install fast-disambig
uv add fast-disambig

Requires: CAMeL Tools data files in ~/.camel_tools/data/. If missing, they are downloaded automatically on first use. (camel_data CLI replacement tool soon!)

Build from source

git clone https://github.com/hadikhamoud/fast-disambig
cd fast-disambig
pip install maturin
maturin develop --release

Usage

Disambiguator

import fast_disambig

dis = fast_disambig.camel.MLEDisambiguator("calima-msa-r13")  

results = dis.disambiguate(["والكتاب", "الجميل"])

Stemmer

stemmer = fast_disambig.camel.Stemmer()  

# light stemming 

stemmer.stem("والكتاب الجميل")
# 'و[+]ال[+]كتاب ال[+]جميل'

stemmer.stem("وَالْكِتَابُ الْجَمِيلُ", preserve_diacritics=True)
# 'وَ[+]الْ[+]كِتَابُ الْ[+]جَمِيلُ'

stemmer.stem("والكتاب الجميل", sep="_")
# 'و_ال_كتاب ال_جميل'

stemmer.stem("والكتاب الجميل", scheme="d3seg")
# 'و[+]ال[+]كتاب ال[+]جميل'

# fallback: try d3tok first, then d3seg, then bwtok if merge fails
stemmer.stem("والكتاب الجميل", fallback=["d3seg", "bwtok"])
# 'و[+]ال[+]كتاب ال[+]جميل'

Disable cache:

stemmer = fast_disambig.camel.Stemmer(cache_size=0)

Tokenizer

fast_disambig.camel.tokenize("والكتاب الجميل", "full")
# ['والكتاب', ' ', 'الجميل']

fast_disambig.camel.tokenize("Hello عالم 123!", "full")
# ['Hello', ' ', 'عالم', ' ', '123', '!']

About

An attempt at a fast Arabic morphology disambiguator

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors