Wiki
PRD
Nestor
Adam
Andrew
Dante
Pulls two public HuggingFace corpora (a Spanish↔Quechua parallel corpus and a Quechua monolingual corpus), cleans and tokenises the text, then extracts a structured lexicon: lemmas with frequency ranks, estimated CEFR difficulty, semantic field tags, and Spanish gloss candidates. Also exports aligned sentence pairs.
pip install -r requirements.txt
Run:
python -m pipeline.run # full run (~20 min, downloads ~500 MB on first run)
python -m pipeline.run --dev # fast dev run (5K parallel / 10K mono sentences)
Output:
quechua/vocabulary/extracted/ # one JSON file per frequency band
quechua/examples/parallel.json # aligned sentence pairs
pipeline/reports/ # run stats and filter counts