Chronos

Wiki
PRD
Nestor Adam Andrew Dante

Quechua Ingestion Pipeline

Pulls two public HuggingFace corpora (a Spanish↔Quechua parallel corpus and a Quechua monolingual corpus), cleans and tokenises the text, then extracts a structured lexicon: lemmas with frequency ranks, estimated CEFR difficulty, semantic field tags, and Spanish gloss candidates. Also exports aligned sentence pairs.

Setup

pip install -r requirements.txt

Run:

python -m pipeline.run        # full run (~20 min, downloads ~500 MB on first run)
python -m pipeline.run --dev  # fast dev run (5K parallel / 10K mono sentences)

Output:

quechua/vocabulary/extracted/   # one JSON file per frequency band                                                                                 
quechua/examples/parallel.json  # aligned sentence pairs
pipeline/reports/               # run stats and filter counts

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
backend		backend
frontend		frontend
pipeline		pipeline
preproc		preproc
.gitignore		.gitignore
README.md		README.md
adamtest.txt		adamtest.txt
andrew.txt		andrew.txt
dante-test.txt		dante-test.txt
nestortest.txt		nestortest.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chronos

Quechua Ingestion Pipeline

Setup

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Chronos

Quechua Ingestion Pipeline

Setup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages