I build data pipelines and storage systems for large-scale speech and NLP datasets, specializing in low-resource African languages. Data contributor on 4 published research papers. Pursuing GCP Associate Cloud Engineer certification.
Current focus
- Automated ETL pipeline processing 12,000+ hours of speech data across 51 African languages
- Scaling text data preparation for 40 → 75 African languages for LLM training
Stack Python · SQL · GCP (BigQuery, Cloud Storage) · HuggingFace Datasets · YAML · pandas · Docker
Domain Speech data · NLP · Low-resource African languages · Audio processing · ASR
| # | Paper | Venue | Year |
|---|---|---|---|
| 01 | How Much Speech Data Is Necessary for ASR in African Languages? | arXiv | 2025 |
| 02 | Sunflower: Expanding Coverage of African Languages in LLMs | arXiv | 2025 |
| 03 | SALT-31: A Machine Translation Benchmark for 31 Ugandan Languages | OpenReview | 2026 |
| 04 | Noise Mapping and Ambient Sound Recordings in Urban Uganda | ResearchGate | 2026 |


