MorSeg study scope

@LinguList as discussed, here's a list of algorithms I would like to compare in terms of how well they perform on small wordlists.

## Baselines
* **Byte-Pair Encoding** ([Gage, 1994](http://www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045.HTM); [Sennrich et al, 2016](https://arxiv.org/pdf/1508.07909))
* **WordPiece** ([Schuster and Nakajima, 2012](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf))
* **Random**

## Segmentation Algorithms
* **LSV** ([Harris, 1955](https://www.jstor.org/stable/411036?origin=crossref)) in different variations ([Hafer and Weiss, 1974](https://www.sciencedirect.com/science/article/pii/0020027174900448); [Hammarström, 2009](https://d1wqtxts1xzle7.cloudfront.net/31050581/10.1.1.156.7898-libre.pdf?1392204575=&response-content-disposition=inline%3B+filename%3DPoor_man_s_word_segmentation_Unsupervise.pdf&Expires=1718617369&Signature=ZBUCO3h1GkLI9eFJ~c0Dw4KKHI-bQeRQHqbrDQtNOD5oIR~kCQq5cPQ1QucuDz1RGUcvqxBtCcERLVKv6YUbi5N~hEnFc6Kf3CytYdz6Vx6nU1bpFGIS7TV9E47XMMJIKn-bT7yGRcKFp4GSoQOu5c2Qh5PKuzQTweLfTcp4AShy81vAtxmaZ6PnOwlQ398ZBKR5D22YnH1XJGaguhVX8n1Pjwpy-qPXu9N0wELmIpThxGCaow5vdq~bE7pKwKK6wz~kDvrQs9ZE74f7ASjPxs7rvX6~fNRohxel7DrjX2PUG2r4GOep0XaZPtId0CCREw8XV5Zh60H5Tzt2US~3MQ__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA); [Çöltekin, 2010](https://coltekin.net/cagri/papers/coltekin2010clin.pdf))
* **Morfessor** ([Creutz and Lagus, 2005](https://tuhat.helsinki.fi/ws/portalfiles/portal/77193625/Creutz05akrr.pdf))
* **Linguistica** ([Goldsmith, 2001](https://watermark.silverchair.com/089120101750300490.pdf?token=AQECAHi208BE49Ooan9kkhW_Ercy7Dm3ZL_9Cf3qfKAc485ysgAAA2AwggNcBgkqhkiG9w0BBwagggNNMIIDSQIBADCCA0IGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMeSFqpqKpdL2BYYDAAgEQgIIDE48VUYn47i_5Qf5z91LNwMSKOimWTLPiuS39lDFUgDnjHDPZjBhsG8VcVGnTOkWVChJ23f_uz98aH52GkIMqONosUvsRgqbD1mGKsn4mwrWshxG0Qkkh3pd1VY60NecAWotjMYwdha2iMFa9qGx9ytRyAmJC1sIacn-q5oualVk47rpcx_Pvp6wvXbvEv7QjSMQhDB5jUg59ZJ8wZ4Z6607k6GMlVrjUiNLvOnAhhnVHh3LtQu7iZnB59waDHxo9ojzjDFR02xbEfsdrgPhXK5E0PE5NCCeaiA9prNjGOL3wcZZtmVqZQ2BqyxxGgk0I5LP7S5eXHAq6xj9tz1G0t9Nrc6O5vLtcsercKS0AqhLi0AwIOUwlETg613lMLVEGcrmQFHbWTs9p5PIsgbguegXmBRo6TPRU2jI6ijd-VHaEaaE7_S2ry2ffxijwQ2dLWbZYwKdw6_hwRq-6lQc5khqOyoVDfrSsqD2RvF0p_MT2cDD-Mw49u6e1qQDVXlWw_5MHEjL0trjKZ9KA_Uz-A7MrdDCtLHLDVjOXMvaLTZkuIyhAdT_zDkfF3AwdCKvjRa3sveST70rBMpw5vYl2zGOklJGYN2NIv1D52HHfxgt2-5wE2Pm4YZ7kLC3oHRbZjuZcRDCxOTHHoD_CasIfJT9xaaNejT6yYO76wBglJySxPhWMUHbgggJ9k6AABbD7PAE4IQG-XrnTYygdxSDr7cYyZAPql1AG531d5HNU6m0QnHmsk5GgABmaRMf1mxqgeM9NMdD2XxzQa9oORB7V1Fq5QaY2nC8-Xij8TRNJY7Q0M1kpzY89P-LiOhZhEeZERNFFNIfjOzNxRyJHOvEjWh9DwyhckZpO39YY2jZqRHExlQ_Rt1OVQclfvG-u-fY4aeNBT2TUbuTWLIZ8-eCYqM351OpTMmTkF7r9HB6z2RcT2bDi65T9hNL8NczFL19TThFn0lz7mBEGdFmS_SDEz9aiGsQhwDvfceCCihP1ub5FQjdSeeRK8RoscYza2fYgCHL4PKJtxfX4of9WvcV2CgQqf2E); [Lee and Goldsmith, 2016](https://aclanthology.org/N16-3005.pdf))
* **MorphAGram** ([Eskander et al., 2020](https://aclanthology.org/2020.lrec-1.879.pdf))
* **"Square Entropy"** ([Medina-Urrea, 2007](https://link.springer.com/chapter/10.1007/978-3-540-37522-7_13); [Méndez-Cruz, 2016](https://www.sciencedirect.com/science/article/pii/S0167865516302343))

Morfessor and Linguistica are already available as Python packages which seem to be actively maintained, and there is an open Python implementation for [MorphAGram](https://github.com/rnd2110/MorphAGram) as well. The other algorithms seem to be fairly easy to implement.

I am especially interested in MorphAGram and the "Square Entropy" methods, since they are the only ones I could find that actually test their methods on small wordlists with ~1,000 items. The other methods listed above are frequently mentioned in the literature and seem to be fairly established, and they have the obvious advantage of already coming as Python packages. There are some other methods that could be interesting later on, but I would focus on these ones first.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MorSeg study scope #11

Baselines

Segmentation Algorithms

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MorSeg study scope #11

Description

Baselines

Segmentation Algorithms

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions