The current work is dedicated to building a morphological analyzer for Bezhta language (< Tsezic < Avar-Andic-Tsezic < Nakh-Dagestan; Glottolog: bezh1248). This repository contains a prototype for a Bezhta morphological analyzer. It is a part of a larger project by the students of the School of Linguistics and the Linguistic Convergence Laboratory at the NRU HSE that aims to provide digital tools for endangered languages.
The project is distributed under the GNU General Public License v3.0.
The parser follows (Comri et al., 2015) and (Madieva, 1965) descriptions of Bezhta Proper with the lexicon gathered from (Khalilov, 2015) dictionary. The digitized version of the dictionary is available at bezhta_dict.
For evaluation, I use Bezhta translation of The Gospel of Luke and The Book of Proverbs, a text from Madieva's grammar (1964) and two annotated texts. The texts are available in the corpora directory
The project requires lexd and hfst. You can get them by the following command:
curl -sS https://apertium.projectjj.com/apt/install-nightly.sh | sudo bash
apt install lexd
apt install hfstmakeAnalyze a word:
echo 'соралила' | hfst-lookup bezhta.analyzer.hfstTransliterator allows to transliterate Bezhta words from Cyrillic to Latin script.
make cy2lat.transliterator.disam.hfstTransliterate a word:
echo 'соралила' | hfst-lookup cy2lat.transliterator.disam.hfstBuild transliterated analyzer:
make bezhta.tr.analyzer.hfstLook up a word in Latin script:
echo 'soralila' | hfst-lookup bezhta.tr.analyzer.hfstThe segmenter identifies the morpheme boundaries in the input word.
make bezhta.segm.hfstSegmenting a word:
echo 'нисойо' | hfst-lookup bezhta.segm.hfstResult:
нисойо нисо>йо
Analyzer:
make bezhta.analyzer.hfstol
mv bezhta.analyzer.hfstol coverage
cd coverage
make check-coverageAdditionally, make-check-unrecog can be used to get a list of unrecognized tokens. Note that all text files should start with text-
Current performance: ~75% naive coverage
Transliterator:
make bezhta.tr.analyzer.hfst
mv bezhta.tr.analyzer.hfst transliterator
make check-coverageNote: some symbols may be recognized incorrectly, I recommend using transliterator_coverage.ipynb instead.
make bezhta.analyzer.hfstol
mv bezhta.analyzer.hfstol accuracy
cd accuracyTo analyze texts with the parser, use
hfst-proc bezhta.analyzer.hfstol text-annotated-1.txt > FILENAME-1.txt
hfst-proc bezhta.analyzer.hfstol text-annotated-1.txt > FILENAME-2.txtThen compute accuracy:
python3 accuracy.py FILENAME-1.txt text-1-gold.txt
python3 accuracy.py FILENAME-2.txt text-2-gold.txtcd guesser
make bezhta.guesser.hfstGuessing a token:
echo 'войъис' bezhta.guesser.hfstFor evaluation, see guesser_evaluation.ipynb