-
Notifications
You must be signed in to change notification settings - Fork 0
Description
@LinguList as discussed, here's a list of algorithms I would like to compare in terms of how well they perform on small wordlists.
Baselines
- Byte-Pair Encoding (Gage, 1994; Sennrich et al, 2016)
- WordPiece (Schuster and Nakajima, 2012)
- Random
Segmentation Algorithms
- LSV (Harris, 1955) in different variations (Hafer and Weiss, 1974; Hammarström, 2009; Çöltekin, 2010)
- Morfessor (Creutz and Lagus, 2005)
- Linguistica (Goldsmith, 2001; Lee and Goldsmith, 2016)
- MorphAGram (Eskander et al., 2020)
- "Square Entropy" (Medina-Urrea, 2007; Méndez-Cruz, 2016)
Morfessor and Linguistica are already available as Python packages which seem to be actively maintained, and there is an open Python implementation for MorphAGram as well. The other algorithms seem to be fairly easy to implement.
I am especially interested in MorphAGram and the "Square Entropy" methods, since they are the only ones I could find that actually test their methods on small wordlists with ~1,000 items. The other methods listed above are frequently mentioned in the literature and seem to be fairly established, and they have the obvious advantage of already coming as Python packages. There are some other methods that could be interesting later on, but I would focus on these ones first.