-
Notifications
You must be signed in to change notification settings - Fork 0
Description
While implementing three methods for automated morpheme segmentation that have not been published before (Benden 2005, Bordag 2008, Kirschenbaum 2013), I noticed that it is not possible to exactly mirror the workflow of the latter two, since they make use of word frequencies in the corpus the method is applied to. This is mostly done to assess semantic similarity by the means of co-occurrence distributions - a measurement which is obviously not available for wordlists.
Since including this semantic information seems to be a crucial part of both workflows, I suggest that we find a way to emulate that. The most obvious choice in my opinion would be simply using pre-trained word embeddings (e.g. GloVe or Word2Vec), since they have proven to be quite useful for a wide range of NLP tasks and are conceptually the same as the distributional vectors obtained by the models in the study. An alternative approach could be quantifying the similarity between two concepts by the means of colexification, essentially exploiting CLICS.
Furthermore, Bordag (2008) employs a compound splitter that uses the frequencies of the individual constituents as well. I have no idea so far how this can be emulated in the context of (small) wordlists.
Feel free to discuss and drop your ideas! (especially @LinguList of course)