Skip to content

Python library and CLI tool designed for evaluating machine translations with advanced syntactic and semantic analysis, providing detailed interpretability for Philippine Languages.

License

Notifications You must be signed in to change notification settings

wimarka-uic/WiMarka

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

WiMarka

WiMarka is a comprehensive Python library and CLI tool designed for evaluating machine translations with advanced syntactic and semantic analysis, providing detailed interpretability for Philippine Languages.

Python Version License: MIT Documentation


πŸ“‹ Table of Contents


πŸ” Overview

WiMarka addresses the critical need for accurate machine translation evaluation in Philippine languages. It goes beyond simple metrics by providing:

  • Error Detection: Identifies specific translation errors between source and target texts
  • Multi-dimensional Scoring: Evaluates translations across fluency, adequacy, and overall quality
  • Explainability: Generates human-readable explanations for detected errors
  • Correction Suggestions: Provides corrected translation alternatives
  • Philippine Language Focus: Specialized support for Cebuano (CEB), Ilocano (ILO), and Tagalog (TGT)

✨ Features

  • πŸ” Error Detection: Advanced algorithms to identify translation inconsistencies and errors
  • πŸ“Š Multi-dimensional Scoring:
    • Fluency Score: Measures how natural the translation reads
    • Adequacy Score: Evaluates semantic completeness and accuracy
    • Overall Quality Score: Comprehensive translation quality assessment
  • πŸ’‘ Explainable Results: Detailed explanations for each detected error
  • πŸ”§ Correction Suggestions: AI-powered suggestions for improving translations
  • πŸ–₯️ Dual Interface: Both Python library and CLI for flexible integration
  • 🌏 Philippine Language Support: Specialized models for CEB, ILO, and TGT
  • πŸ“ Batch Processing: Evaluate multiple sentence pairs efficiently

🌐 Supported Languages

Code Language Role
EN English Source/Target
CEB Cebuano Target
ILO Ilocano Target
TGT Tagalog Target

πŸ“¦ Prerequisites

Before installing WiMarka, ensure you have:

  • Python >= 3.12
  • Microsoft Visual Studio with CMake installed

πŸš€ Installation

Using pip (Recommended)

Install directly from GitHub:

pip install git+https://github.com/wimarka-uic/WiMarka.git

From Source

# Clone the repository
git clone https://github.com/wimarka-uic/WiMarka.git
cd WiMarka

# Install in development mode
pip install -e .

πŸ’» Usage

Python Library

Use WiMarka programmatically in your Python projects:

from wimarka.main import wmk_eval

# Evaluate translations
wmk_eval(
    src_file_path='source_file.txt',  # Path to source text file
    src_lang='EN',                     # Source language code
    tgt_file_path='target_file.txt',  # Path to target translation file
    tgt_lang='CEB'                     # Target language code
)

Input File Format

Both source and target files should be plain text files with:

  • One sentence per line
  • UTF-8 encoding
  • Equal number of lines in both files

Example source_file.txt:

Good morning!
How are you today?

Example target_file.txt:

Maayong buntag!
Kumusta ka karon?

Command-Line Interface (CLI)

Evaluate translations directly from the terminal:

wimarka --src_file_path source_file.txt \
        --src_lang EN \
        --tgt_file_path target_file.txt \
        --tgt_lang CEB

CLI Options

Option Description Required
--src_file_path Path to the source text file Yes
--src_lang Source language code (EN, CEB, ILO, TGT) Yes
--tgt_file_path Path to the target text file Yes
--tgt_lang Target language code (CEB, ILO, TGT) Yes
-h, --help Show help message No

οΏ½ Documentation

For comprehensive documentation, visit WiMarka Documentation on ReadtheDocs.

The documentation includes:

  • User Manual: Installation, usage guides, examples, and best practices
  • Technical Manual: Architecture, API reference, and development guides

Quick Links


οΏ½πŸ“ Project Structure

WiMarka/
β”œβ”€β”€ wimarka/                    # Main package directory
β”‚   β”œβ”€β”€ __init__.py            # Package initialization
β”‚   β”œβ”€β”€ main.py                # Core evaluation logic
β”‚   β”œβ”€β”€ cli.py                 # Command-line interface
β”‚   β”œβ”€β”€ config.py              # Configuration settings
β”‚   β”œβ”€β”€ tasks/                 # Task modules
β”‚   β”‚   β”œβ”€β”€ error_detection.py     # Error detection logic
β”‚   β”‚   β”œβ”€β”€ scoring.py             # Translation scoring
β”‚   β”‚   β”œβ”€β”€ explanation.py         # Error explanation generation
β”‚   β”‚   └── correction.py          # Correction suggestion generation
β”‚   └── utils/                 # Utility modules
β”‚       β”œβ”€β”€ helper.py              # Helper functions
β”‚       β”œβ”€β”€ logger.py              # Logging utilities
β”‚       β”œβ”€β”€ model.py               # Model loading and management
β”‚       β”œβ”€β”€ cache.py               # Caching utilities
β”‚       └── torch.py               # PyTorch utilities
β”œβ”€β”€ test/                      # Test files and examples
β”‚   β”œβ”€β”€ main.py                # Test script
β”‚   β”œβ”€β”€ source_file.txt        # Sample source file
β”‚   └── target_file.txt        # Sample target file
β”œβ”€β”€ setup.py                   # Package installation configuration
β”œβ”€β”€ requirements.txt           # Python dependencies
β”œβ”€β”€ LICENSE                    # MIT License
└── README.md                  # This file

βš™οΈ How It Works

WiMarka follows a four-stage evaluation pipeline:

  1. Error Detection πŸ”

    • Analyzes source and target sentences
    • Identifies syntactic and semantic errors
    • Categorizes error types
  2. Scoring πŸ“Š

    • Calculates fluency score (0-100)
    • Calculates adequacy score (0-100)
    • Computes overall quality score (0-100)
  3. Explanation Generation πŸ’‘

    • Provides human-readable explanations for each error
    • Contextualizes issues in terms of linguistic quality
    • Highlights specific problematic segments
  4. Correction Suggestion πŸ”§

    • Generates improved translation alternatives
    • Addresses identified errors
    • Maintains semantic integrity

πŸ“ Example

Here's a complete example demonstrating WiMarka's capabilities:

Input Files:

en_source.txt:

Good morning!
How are you today?

ceb_translation.txt:

Magandang gabi!
Kamusta ka na ngayon?

Python Code:

from wimarka.main import wmk_eval

wmk_eval(
    src_file_path='en_source.txt',
    src_lang='EN',
    tgt_file_path='ceb_translation.txt',
    tgt_lang='CEB'
)

Sample Output:

Evaluating line 1/2
Detecting errors...
Scoring translation...
Generating explanation...
Correcting translation...

Evaluating line 2/2
Detecting errors...
Scoring translation...
Generating explanation...
Correcting translation...

=== Evaluation Results ===
----------------------------------------
Line 1:
  Source: Good morning!
  Target: Magandang gabi!
  Errors: [Semantic mismatch: "morning" vs "gabi" (evening)]
  Fluency Score: 95/100
  Adequacy Score: 40/100
  Overall Score: 67.5/100
  Explanation: The translation has incorrect time reference...
  Suggested Correction: Maayong buntag!
----------------------------------------

Evaluation completed.

πŸ› οΈ Development

Setting Up Development Environment

# Clone the repository
git clone https://github.com/wimarka-uic/WiMarka.git
cd WiMarka

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install in editable mode
pip install -e .

Running Tests

cd test
python main.py

🀝 Contributing

We welcome contributions from the community! To contribute:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Please ensure your code follows the project's coding standards and includes appropriate tests.


πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

Copyright 2025 University of the Immaculate Conception - College of Computer Studies

πŸ‘₯ Authors

University of the Immaculate Conception - College of Computer Studies


πŸ“š Citation

If you use WiMarka in your research, please cite:

@software{wimarka2025,
  title={WiMarka: A Reference-free Evaluation Metric for Machine 
Translation of Philippine Languages},
  author={University of the Immaculate Conception},
  year={2025},
  url={https://github.com/wimarka-uic/WiMarka}
}

πŸ™ Acknowledgments


πŸ“§ Contact & Support

For questions, issues, or suggestions:


Made with ❀️ for Philippine Languages

About

Python library and CLI tool designed for evaluating machine translations with advanced syntactic and semantic analysis, providing detailed interpretability for Philippine Languages.

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages