A production-ready, comprehensive Python package for text normalization across multiple Indian languages, built on Weighted Finite-State Transducers (WFST) using Pynini.
Note: This project is an extension of NVIDIA NeMo Text Processing, focused on providing comprehensive support for Indic languages. It maintains the same architecture and a similar API while adding specialized language modules for 19 Indian languages.
indic-text-normalization is a production-ready, WFST-based library that provides deterministic, low-latency, and explainable text normalization for TTS, ASR, and NLP pipelines, converting semiotic entities (numbers, dates, currency) into spoken form across 19 Indian languages.
- Hindi (hi) - हिन्दी
- Bengali (bn) - বাংলা
- Marathi (mr) - मराठी
- Telugu (te) - తెలుగు
- Kannada (kn) - ಕನ್ನಡ
- Bhojpuri (bho) - भोजपुरी
- Magahi (mag) - मगही
- Chhattisgarhi (hne) - छत्तीसगढ़ी
- Maithili (mai) - मैथिली
- Assamese (as) - অসমীয়া
- Bodo (brx) - बड़ो
- Dogri (doi) - डोगरी
- Gujarati (gu) - ગુજરાતી
- Malayalam (ml) - മലയാളം
- Punjabi (pa) - ਪੰਜਾਬੀ
- Tamil (ta) - தமிழ்
- English (Indian) (en) - English
- Nepali (ne) - नेपाली
- Sanskrit (sa) - संस्कृत
Each language module supports normalization of:
- Cardinal Numbers - Both native script and Arabic numerals
- Ordinal Numbers - First, second, third, etc.
- Decimal Numbers - Decimal point verbalization
- Fractions - Proper fraction handling
- Date - Multiple date formats (DD/MM/YYYY, YYYY-MM-DD, etc.)
- Time - 12-hour and 24-hour formats
- Telephone Numbers - Digit-by-digit verbalization
- Measurements - Units with proper verbalization
- Money - Currency symbols and amounts
- Electronic - Email addresses, URLs, hashtags
- Roman Numerals - Conversion to spoken form
- Abbreviations - Common abbreviations expansion
- Whitelist - Custom word mappings
- Python 3.10, 3.11, or 3.12
- For macOS: Homebrew (for installing OpenFST)
uv is a fast Python package installer. This is the recommended method for macOS users.
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# or with Homebrew: brew install uv
# Install OpenFST (required for pynini)
brew install openfst
# Create virtual environment
uv venv
source .venv/bin/activate # On macOS/Linux
# Install pynini with proper compiler flags
CPLUS_INCLUDE_PATH=/opt/homebrew/include LIBRARY_PATH=/opt/homebrew/lib uv pip install -e .Note: The project uses pynini>=2.1.6.post1 which is compatible with the latest OpenFST (1.8.4+).
conda create -n indic_tn python=3.10
conda activate indic_tn
conda install -c conda-forge pynini
pip install -r requirements.txt# Install OpenFST
brew install openfst
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install pynini with proper compiler flags
CPLUS_INCLUDE_PATH=/opt/homebrew/include LIBRARY_PATH=/opt/homebrew/lib pip install pynini
# Install all dependencies
pip install -r requirements.txtmacOS ARM (M1/M2/M3) Users: If you encounter compilation errors with pynini, ensure:
- OpenFST is installed via Homebrew:
brew install openfst - Use the compiler flags when installing pynini:
CPLUS_INCLUDE_PATH=/opt/homebrew/include LIBRARY_PATH=/opt/homebrew/lib
Version Compatibility: This package requires pynini>=2.1.7 which is compatible with OpenFST 1.8.4+. Older versions of pynini (2.1.6.post1 and below) may have compatibility issues with newer OpenFST installations.
from indic_text_normalization import Normalizer
# Initialize normalizer for a specific language
normalizer = Normalizer(input_case='cased', lang='hi')
# Normalize text
text = "मैं 25 साल का हूं और मेरा फोन नंबर 9876543210 है।"
normalized = normalizer.normalize(text)
print(normalized)
# Output: मैं पच्चीस साल का हूं और मेरा फोन नंबर नौ आठ सात छह पांच चार तीन दो एक शून्य है।Tamil:
normalizer = Normalizer(input_case='cased', lang='ta')
text = "எனக்கு 1,234 ரூபாய் வேண்டும்"
normalized = normalizer.normalize(text)Malayalam:
normalizer = Normalizer(input_case='cased', lang='ml')
text = "സമയം 10:30 ആണ്"
normalized = normalizer.normalize(text)Gujarati:
normalizer = Normalizer(input_case='cased', lang='gu')
text = "તારીખ 15/08/2024 છે"
normalized = normalizer.normalize(text)indic-text-normalization/
├── text_normalization/
│ ├── hi/ # Hindi
│ ├── ta/ # Tamil
│ ├── ml/ # Malayalam
│ ├── gu/ # Gujarati
│ ├── as/ # Assamese
│ ├── mai/ # Maithili
│ ├── mag/ # Magadhi
│ ├── hne/ # Chhattisgarhi
│ ├── brx/ # Bodo
│ ├── doi/ # Dogri
│ ├── pa/ # Punjabi
│ └── sa/ # Sanskrit
├── tests/ # Test notebooks and scripts
└── README.md
Each language directory contains:
taggers/- FST-based taggers for each semiotic classverbalizers/- FST-based verbalizers for converting tagged text to spoken formdata/- TSV files with language-specific terminologytokenize_and_classify.py- Main classification logic
Tests are available in the tests/ directory for each language:
python tests/run_tests.py --lang hi
python tests/run_tests.py --lang ta
python tests/run_tests.py --lang mr
python tests/run_tests.py --lang te
python tests/run_tests.py --lang kn
python tests/run_tests.py --lang gu
python tests/run_tests.py --lang as
python tests/run_tests.py --lang brx
python tests/run_tests.py --lang doi
python tests/run_tests.py --lang pu
python tests/run_tests.py --lang mai
python tests/run_tests.py --lang mag
python tests/run_tests.py --lang hne
# ... etc.- Create a new directory under
text_normalization/with the language code - Replicate the structure from an existing language (e.g., Hindi)
- Update all TSV data files with native terminology
- Modify taggers and verbalizers for language-specific rules
- Update
normalizer.pyto include the new language - Create test cases
Contributions are welcome! Please ensure:
- All data uses authentic native terminology
- Code follows existing patterns and structure
- Test cases are provided for new features
- Documentation is updated
The normalization pipeline uses Weighted Finite-State Transducers (WFST) implemented with Pynini:
- Tokenization - Input text is tokenized
- Classification - Each token is classified into semiotic classes
- Verbalization - Classified tokens are converted to spoken form
- Post-processing - Final cleanup and formatting
- Arabic Digits (0-9) - Converted to native script or verbalized directly
- Native Script Digits - Devanagari (०-९), Tamil (௦-௯), etc.
- Mixed Input - Handles both digit systems in the same text
This project is licensed under the Apache 2.0 License. See the LICENSE file for details.
This project is built upon NVIDIA NeMo Text Processing, extending it with comprehensive support for 19 Indian languages. We are grateful to NVIDIA and Google for their foundational work on WFST-based text normalization.
For questions, bug reports, or feature requests, please open an issue on the GitHub repository.