Skip to content

Kenpath/indic-text-normalization

Repository files navigation

Indic Text Normalization

A production-ready, comprehensive Python package for text normalization across multiple Indian languages, built on Weighted Finite-State Transducers (WFST) using Pynini.

Note: This project is an extension of NVIDIA NeMo Text Processing, focused on providing comprehensive support for Indic languages. It maintains the same architecture and a similar API while adding specialized language modules for 19 Indian languages.

Overview

indic-text-normalization is a production-ready, WFST-based library that provides deterministic, low-latency, and explainable text normalization for TTS, ASR, and NLP pipelines, converting semiotic entities (numbers, dates, currency) into spoken form across 19 Indian languages.

Supported Languages

  • Hindi (hi) - हिन्दी
  • Bengali (bn) - বাংলা
  • Marathi (mr) - मराठी
  • Telugu (te) - తెలుగు
  • Kannada (kn) - ಕನ್ನಡ
  • Bhojpuri (bho) - भोजपुरी
  • Magahi (mag) - मगही
  • Chhattisgarhi (hne) - छत्तीसगढ़ी
  • Maithili (mai) - मैथिली
  • Assamese (as) - অসমীয়া
  • Bodo (brx) - बड़ो
  • Dogri (doi) - डोगरी
  • Gujarati (gu) - ગુજરાતી
  • Malayalam (ml) - മലയാളം
  • Punjabi (pa) - ਪੰਜਾਬੀ
  • Tamil (ta) - தமிழ்
  • English (Indian) (en) - English
  • Nepali (ne) - नेपाली
  • Sanskrit (sa) - संस्कृत

Features

Semiotic Classes

Each language module supports normalization of:

  • Cardinal Numbers - Both native script and Arabic numerals
  • Ordinal Numbers - First, second, third, etc.
  • Decimal Numbers - Decimal point verbalization
  • Fractions - Proper fraction handling
  • Date - Multiple date formats (DD/MM/YYYY, YYYY-MM-DD, etc.)
  • Time - 12-hour and 24-hour formats
  • Telephone Numbers - Digit-by-digit verbalization
  • Measurements - Units with proper verbalization
  • Money - Currency symbols and amounts
  • Electronic - Email addresses, URLs, hashtags
  • Roman Numerals - Conversion to spoken form
  • Abbreviations - Common abbreviations expansion
  • Whitelist - Custom word mappings

Installation

Prerequisites

  • Python 3.10, 3.11, or 3.12
  • For macOS: Homebrew (for installing OpenFST)

Method 1: Using uv (Recommended for macOS)

uv is a fast Python package installer. This is the recommended method for macOS users.

# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# or with Homebrew: brew install uv

# Install OpenFST (required for pynini)
brew install openfst

# Create virtual environment
uv venv
source .venv/bin/activate  # On macOS/Linux

# Install pynini with proper compiler flags
CPLUS_INCLUDE_PATH=/opt/homebrew/include LIBRARY_PATH=/opt/homebrew/lib uv pip install -e .

Note: The project uses pynini>=2.1.6.post1 which is compatible with the latest OpenFST (1.8.4+).

Method 2: Using Conda

conda create -n indic_tn python=3.10
conda activate indic_tn
conda install -c conda-forge pynini
pip install -r requirements.txt

Method 3: Using pip (macOS)

# Install OpenFST
brew install openfst

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install pynini with proper compiler flags
CPLUS_INCLUDE_PATH=/opt/homebrew/include LIBRARY_PATH=/opt/homebrew/lib pip install pynini

# Install all dependencies
pip install -r requirements.txt

Troubleshooting

macOS ARM (M1/M2/M3) Users: If you encounter compilation errors with pynini, ensure:

  • OpenFST is installed via Homebrew: brew install openfst
  • Use the compiler flags when installing pynini: CPLUS_INCLUDE_PATH=/opt/homebrew/include LIBRARY_PATH=/opt/homebrew/lib

Version Compatibility: This package requires pynini>=2.1.7 which is compatible with OpenFST 1.8.4+. Older versions of pynini (2.1.6.post1 and below) may have compatibility issues with newer OpenFST installations.

Usage

Basic Usage

from indic_text_normalization import Normalizer

# Initialize normalizer for a specific language
normalizer = Normalizer(input_case='cased', lang='hi')

# Normalize text
text = "मैं 25 साल का हूं और मेरा फोन नंबर 9876543210 है।"
normalized = normalizer.normalize(text)
print(normalized)
# Output: मैं पच्चीस साल का हूं और मेरा फोन नंबर नौ आठ सात छह पांच चार तीन दो एक शून्य है।

Language-Specific Examples

Tamil:

normalizer = Normalizer(input_case='cased', lang='ta')
text = "எனக்கு 1,234 ரூபாய் வேண்டும்"
normalized = normalizer.normalize(text)

Malayalam:

normalizer = Normalizer(input_case='cased', lang='ml')
text = "സമയം 10:30 ആണ്"
normalized = normalizer.normalize(text)

Gujarati:

normalizer = Normalizer(input_case='cased', lang='gu')
text = "તારીખ 15/08/2024 છે"
normalized = normalizer.normalize(text)

Project Structure

indic-text-normalization/
├── text_normalization/
│   ├── hi/          # Hindi
│   ├── ta/          # Tamil
│   ├── ml/          # Malayalam
│   ├── gu/          # Gujarati
│   ├── as/          # Assamese
│   ├── mai/         # Maithili
│   ├── mag/         # Magadhi
│   ├── hne/         # Chhattisgarhi
│   ├── brx/         # Bodo
│   ├── doi/         # Dogri
│   ├── pa/          # Punjabi
│   └── sa/          # Sanskrit
├── tests/           # Test notebooks and scripts
└── README.md

Each language directory contains:

  • taggers/ - FST-based taggers for each semiotic class
  • verbalizers/ - FST-based verbalizers for converting tagged text to spoken form
  • data/ - TSV files with language-specific terminology
  • tokenize_and_classify.py - Main classification logic

Testing

Tests are available in the tests/ directory for each language:

python tests/run_tests.py --lang hi
python tests/run_tests.py --lang ta
python tests/run_tests.py --lang mr
python tests/run_tests.py --lang te
python tests/run_tests.py --lang kn
python tests/run_tests.py --lang gu
python tests/run_tests.py --lang as
python tests/run_tests.py --lang brx
python tests/run_tests.py --lang doi
python tests/run_tests.py --lang pu
python tests/run_tests.py --lang mai
python tests/run_tests.py --lang mag
python tests/run_tests.py --lang hne
# ... etc.

Development

Adding a New Language

  1. Create a new directory under text_normalization/ with the language code
  2. Replicate the structure from an existing language (e.g., Hindi)
  3. Update all TSV data files with native terminology
  4. Modify taggers and verbalizers for language-specific rules
  5. Update normalizer.py to include the new language
  6. Create test cases

Contributing

Contributions are welcome! Please ensure:

  • All data uses authentic native terminology
  • Code follows existing patterns and structure
  • Test cases are provided for new features
  • Documentation is updated

Technical Details

WFST Architecture

The normalization pipeline uses Weighted Finite-State Transducers (WFST) implemented with Pynini:

  1. Tokenization - Input text is tokenized
  2. Classification - Each token is classified into semiotic classes
  3. Verbalization - Classified tokens are converted to spoken form
  4. Post-processing - Final cleanup and formatting

Digit Handling

  • Arabic Digits (0-9) - Converted to native script or verbalized directly
  • Native Script Digits - Devanagari (०-९), Tamil (௦-௯), etc.
  • Mixed Input - Handles both digit systems in the same text

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Acknowledgments

This project is built upon NVIDIA NeMo Text Processing, extending it with comprehensive support for 19 Indian languages. We are grateful to NVIDIA and Google for their foundational work on WFST-based text normalization.

Support

For questions, bug reports, or feature requests, please open an issue on the GitHub repository.

About

Text Normalization utilities for normalizing text for TTS

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages