Skip to content

Preprocessing Endangered Languages

Adam Nalley edited this page May 6, 2026 · 1 revision

Some prominent languages (such as English, Spanish, French, etc.) already have lexicons readily available to be used in any language learning app, yet endangered/indigenous languages may not have one, in which extensive work would be needed to prepare either a subset of words, or to dedicate insurmountable amounts of labor to prepare a full lexicon. Our software would enable linguistic scholars to rapidly develop a lexicon through supplying samples of the language and an IPA (International Phonetic Alphabet) representation of the language, to which the lexicon could then be further used in apps like Chronos.

CS194 - Preprocessing Idea

What is a lexicon?

A lexicon is an alphabetical arrangement of all the words and terms used in a specific language. You can think of it as a dictionary, but more so to serve as a list of all words used in a language rather than a list of definitions. Language learning apps would need words to teach, and the way that those words would be stored is in some type of a table, which would have the word used in that language, its translation in the primary learning environment (in this case, English), and possibly also the word in IPA format, to be used in audio playback. IPA is just a standardardized way primarily used by linguists to accurately transcribe a language, so that it can be properly pronounced.

Process

  1. The program would take as input text/audio segments which contain vocabulary and English translations, and a phonology table using IPA, which contains all the sounds of the language
  2. In the case of there being many spellings, the program will craft an average from all different spellings, and outliers would be either discarded or saved as a synonym.
  3. A threshold could be used to dictate granularity in spelling.
  4. The lexicon would be a map data structure, in which each entry is of the form { languageTerm : (englishTranslation, ipaRepresentation) }, and saved as a portable file.
  5. The lexicon would then be loaded into an app like Chronos as implementation for different features.
  6. The languageTerm would then be displayed on the screen along with its associated englishTranslation, while an LLM like Claude would audibly read out the ipaRepresentation entry for the associated languageTerm.

Impact

This portion of the project would be impactful for those doing language research in endangered and indigenous languages, as the work is made difficult through a lack of resources, lack of fluent speakers, and incomplete data. The work done through this program would help fill in the gaps and make the job more efficient. It would also be impactful for tribal communities that seek to fortify their cultural and language revitalization efforts, in which the union of digital technology and language revitalization can make for a less rigorous process overall. Many language learning materials are still created in print form, but extensive work would be needed to create digital environments comparable to that of other language learning apps like Duolingo.

Clone this wiki locally