Indic Text Normalization

A production-ready, comprehensive Python package for text normalization across multiple Indian languages, built on Weighted Finite-State Transducers (WFST) using Pynini.

Note: This project is an extension of NVIDIA NeMo Text Processing, focused on providing comprehensive support for Indic languages. It maintains the same architecture and a similar API while adding specialized language modules for 19 Indian languages.

Overview

indic-text-normalization is a production-ready, WFST-based library that provides deterministic, low-latency, and explainable text normalization for TTS, ASR, and NLP pipelines, converting semiotic entities (numbers, dates, currency) into spoken form across 19 Indian languages.

Supported Languages

Hindi (hi) - हिन्दी
Bengali (bn) - বাংলা
Marathi (mr) - मराठी
Telugu (te) - తెలుగు
Kannada (kn) - ಕನ್ನಡ
Bhojpuri (bho) - भोजपुरी
Magahi (mag) - मगही
Chhattisgarhi (hne) - छत्तीसगढ़ी
Maithili (mai) - मैथिली
Assamese (as) - অসমীয়া
Bodo (brx) - बड़ो
Dogri (doi) - डोगरी
Gujarati (gu) - ગુજરાતી
Malayalam (ml) - മലയാളം
Punjabi (pa) - ਪੰਜਾਬੀ
Tamil (ta) - தமிழ்
English (Indian) (en) - English
Nepali (ne) - नेपाली
Sanskrit (sa) - संस्कृत

Features

Semiotic Classes

Each language module supports normalization of:

Cardinal Numbers - Both native script and Arabic numerals
Ordinal Numbers - First, second, third, etc.
Decimal Numbers - Decimal point verbalization
Fractions - Proper fraction handling
Date - Multiple date formats (DD/MM/YYYY, YYYY-MM-DD, etc.)
Time - 12-hour and 24-hour formats
Telephone Numbers - Digit-by-digit verbalization
Measurements - Units with proper verbalization
Money - Currency symbols and amounts
Electronic - Email addresses, URLs, hashtags
Roman Numerals - Conversion to spoken form
Abbreviations - Common abbreviations expansion
Whitelist - Custom word mappings

Installation

Prerequisites

Python 3.10, 3.11, or 3.12
For macOS: Homebrew (for installing OpenFST)

Method 1: Using `uv` (Recommended for macOS)

uv is a fast Python package installer. This is the recommended method for macOS users.

# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# or with Homebrew: brew install uv

# Install OpenFST (required for pynini)
brew install openfst

# Create virtual environment
uv venv
source .venv/bin/activate  # On macOS/Linux

# Install pynini with proper compiler flags
CPLUS_INCLUDE_PATH=/opt/homebrew/include LIBRARY_PATH=/opt/homebrew/lib uv pip install -e .

Note: The project uses pynini>=2.1.6.post1 which is compatible with the latest OpenFST (1.8.4+).

Method 2: Using Conda

conda create -n indic_tn python=3.10
conda activate indic_tn
conda install -c conda-forge pynini
pip install -r requirements.txt

Method 3: Using pip (macOS)

# Install OpenFST
brew install openfst

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install pynini with proper compiler flags
CPLUS_INCLUDE_PATH=/opt/homebrew/include LIBRARY_PATH=/opt/homebrew/lib pip install pynini

# Install all dependencies
pip install -r requirements.txt

Troubleshooting

macOS ARM (M1/M2/M3) Users: If you encounter compilation errors with pynini, ensure:

OpenFST is installed via Homebrew: brew install openfst
Use the compiler flags when installing pynini: CPLUS_INCLUDE_PATH=/opt/homebrew/include LIBRARY_PATH=/opt/homebrew/lib

Version Compatibility: This package requires pynini>=2.1.7 which is compatible with OpenFST 1.8.4+. Older versions of pynini (2.1.6.post1 and below) may have compatibility issues with newer OpenFST installations.

Usage

Basic Usage

from indic_text_normalization import Normalizer

# Initialize normalizer for a specific language
normalizer = Normalizer(input_case='cased', lang='hi')

# Normalize text
text = "मैं 25 साल का हूं और मेरा फोन नंबर 9876543210 है।"
normalized = normalizer.normalize(text)
print(normalized)
# Output: मैं पच्चीस साल का हूं और मेरा फोन नंबर नौ आठ सात छह पांच चार तीन दो एक शून्य है।

Language-Specific Examples

Tamil:

normalizer = Normalizer(input_case='cased', lang='ta')
text = "எனக்கு 1,234 ரூபாய் வேண்டும்"
normalized = normalizer.normalize(text)

Malayalam:

normalizer = Normalizer(input_case='cased', lang='ml')
text = "സമയം 10:30 ആണ്"
normalized = normalizer.normalize(text)

Gujarati:

normalizer = Normalizer(input_case='cased', lang='gu')
text = "તારીખ 15/08/2024 છે"
normalized = normalizer.normalize(text)

Project Structure

indic-text-normalization/
├── text_normalization/
│   ├── hi/          # Hindi
│   ├── ta/          # Tamil
│   ├── ml/          # Malayalam
│   ├── gu/          # Gujarati
│   ├── as/          # Assamese
│   ├── mai/         # Maithili
│   ├── mag/         # Magadhi
│   ├── hne/         # Chhattisgarhi
│   ├── brx/         # Bodo
│   ├── doi/         # Dogri
│   ├── pa/          # Punjabi
│   └── sa/          # Sanskrit
├── tests/           # Test notebooks and scripts
└── README.md

Each language directory contains:

taggers/ - FST-based taggers for each semiotic class
verbalizers/ - FST-based verbalizers for converting tagged text to spoken form
data/ - TSV files with language-specific terminology
tokenize_and_classify.py - Main classification logic

Testing

Tests are available in the tests/ directory for each language:

python tests/run_tests.py --lang hi
python tests/run_tests.py --lang ta
python tests/run_tests.py --lang mr
python tests/run_tests.py --lang te
python tests/run_tests.py --lang kn
python tests/run_tests.py --lang gu
python tests/run_tests.py --lang as
python tests/run_tests.py --lang brx
python tests/run_tests.py --lang doi
python tests/run_tests.py --lang pu
python tests/run_tests.py --lang mai
python tests/run_tests.py --lang mag
python tests/run_tests.py --lang hne
# ... etc.

Development

Adding a New Language

Create a new directory under text_normalization/ with the language code
Replicate the structure from an existing language (e.g., Hindi)
Update all TSV data files with native terminology
Modify taggers and verbalizers for language-specific rules
Update normalizer.py to include the new language
Create test cases

Contributing

Contributions are welcome! Please ensure:

All data uses authentic native terminology
Code follows existing patterns and structure
Test cases are provided for new features
Documentation is updated

Technical Details

WFST Architecture

The normalization pipeline uses Weighted Finite-State Transducers (WFST) implemented with Pynini:

Tokenization - Input text is tokenized
Classification - Each token is classified into semiotic classes
Verbalization - Classified tokens are converted to spoken form
Post-processing - Final cleanup and formatting

Digit Handling

Arabic Digits (0-9) - Converted to native script or verbalized directly
Native Script Digits - Devanagari (०-९), Tamil (௦-௯), etc.
Mixed Input - Handles both digit systems in the same text

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Acknowledgments

This project is built upon NVIDIA NeMo Text Processing, extending it with comprehensive support for 19 Indian languages. We are grateful to NVIDIA and Google for their foundational work on WFST-based text normalization.

Support

For questions, bug reports, or feature requests, please open an issue on the GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
data/test_cases		data/test_cases
indic_text_normalization		indic_text_normalization
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
NOTICE		NOTICE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Indic Text Normalization

Overview

Supported Languages

Features

Semiotic Classes

Installation

Prerequisites

Method 1: Using `uv` (Recommended for macOS)

Method 2: Using Conda

Method 3: Using pip (macOS)

Troubleshooting

Usage

Basic Usage

Language-Specific Examples

Project Structure

Testing

Development

Adding a New Language

Contributing

Technical Details

WFST Architecture

Digit Handling

License

Acknowledgments

Support

About

Uh oh!

Releases

Packages

Contributors 3

Languages

License

Kenpath/indic-text-normalization

Folders and files

Latest commit

History

Repository files navigation

Indic Text Normalization

Overview

Supported Languages

Features

Semiotic Classes

Installation

Prerequisites

Method 1: Using uv (Recommended for macOS)

Method 2: Using Conda

Method 3: Using pip (macOS)

Troubleshooting

Usage

Basic Usage

Language-Specific Examples

Project Structure

Testing

Development

Adding a New Language

Contributing

Technical Details

WFST Architecture

Digit Handling

License

Acknowledgments

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Method 1: Using `uv` (Recommended for macOS)

Packages