Skip to content

A powerful Python tool to translate PDF documents between 100+ languages with support for both regular and scanned PDFs using OCR technology.

License

Notifications You must be signed in to change notification settings

Monster-ZeroX/PDF-Translator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“„ PDF Translator - Multi-Language PDF Translation Tool

License Python Platform

A powerful Python tool to translate PDF documents between 100+ languages with support for both regular and scanned PDFs using OCR technology.

✨ Features

  • 🌍 100+ Languages - Translate between any language pair (Sinhala, English, French, Spanish, Chinese, Arabic, etc.)
  • πŸ“‘ Regular PDFs - Extract and translate text-based PDFs
  • πŸ“· Scanned PDFs - OCR support for image-based/scanned documents
  • 🎨 Smart Formatting - Preserves document structure and formatting
  • πŸ“ Exam Paper Mode - Special formatting for exams/tests (keeps questions together)
  • πŸ” Language Search - Easy-to-use searchable language selector
  • πŸ–±οΈ GUI Interface - User-friendly file and folder pickers
  • πŸ“¦ Auto-Install - Automatically installs Python dependencies
  • πŸ’― Free - Uses Google Translate API (no API key required)
  • πŸ”„ Retry Logic - Automatically retries failed translations

πŸ“‹ Table of Contents

πŸš€ Quick Start

# Clone the repository
git clone https://github.com/Monster-ZeroX/PDF-Translator.git
cd PDF-Translator

# Run the translator (Python dependencies auto-install on first run!)
python pdf_translator.py

The script will:

  1. βœ… Auto-install any missing Python packages
  2. πŸ“‚ Open a file picker to select your PDF
  3. 🌍 Ask for source language (the PDF's current language)
  4. 🌍 Ask for target language (what you want to translate to)
  5. πŸ“ Ask if it's an exam paper (for better formatting)
  6. ✨ Process and save the translated PDF!

πŸ“₯ Installation

Windows

Step 1: Install Python

  1. Download Python from python.org
  2. ⚠️ Important: Check "Add Python to PATH" during installation
  3. Verify installation:
    python --version

Step 2: Get the Code

# Clone or download this repository
cd DocTrans

# That's it! Python packages will auto-install when you run the script
# (Optional) Manually install: pip install -r requirements.txt

Step 3: (Optional) Install OCR Dependencies for Scanned PDFs

⚠️ Only needed if you want to translate scanned/image PDFs!
Regular text-based PDFs work perfectly fine without these!

Tesseract OCR:

  1. Download from UB Mannheim Tesseract
  2. Install the .exe file
  3. During installation, select language data for your source language (e.g., Sinhala, Arabic, etc.)
  4. Add to PATH or the script will show you instructions if needed

Poppler:

  1. Download from Poppler Windows
  2. Extract to C:\poppler (or any location)
  3. Add C:\poppler\Library\bin to your system PATH:
    • Right-click "This PC" β†’ Properties β†’ Advanced System Settings
    • Environment Variables β†’ System Variables β†’ Path β†’ Edit
    • Add new entry: C:\poppler\Library\bin

Step 4: Run the Translator

# Simple way - just run it!
python pdf_translator.py

# Or use the batch file
run_translator.bat

macOS

Step 1: Install Homebrew (if not installed)

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Step 2: Install Python (if not installed)

brew install python
python3 --version

Step 3: Get the Code

cd DocTrans

# That's it! Python packages will auto-install when you run the script
# (Optional) Manually install: pip3 install -r requirements.txt

Step 4: (Optional) Install OCR Dependencies for Scanned PDFs

⚠️ Only needed for scanned/image PDFs!
Regular text-based PDFs work perfectly fine without these!

# Install Tesseract OCR
brew install tesseract

# Install Poppler
brew install poppler

# Install additional language data for Tesseract (if needed)
# The script will show you instructions if specific language data is missing

Step 5: Run the Translator

python3 pdf_translator.py

Linux

Step 1: Install Python (usually pre-installed)

# Check Python version
python3 --version

# If not installed (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install python3 python3-pip

# For Fedora/RHEL
sudo dnf install python3 python3-pip

Step 2: Get the Code

cd DocTrans

# That's it! Python packages will auto-install when you run the script
# (Optional) Manually install: pip3 install -r requirements.txt

Step 3: (Optional) Install OCR Dependencies for Scanned PDFs

⚠️ Only needed for scanned/image PDFs!
Regular text-based PDFs work perfectly fine without these!

Ubuntu/Debian:

# Install Tesseract OCR
sudo apt-get update
sudo apt-get install tesseract-ocr

# Install Poppler
sudo apt-get install poppler-utils

# Install language data (example for Sinhala)
sudo apt-get install tesseract-ocr-sin

Fedora/RHEL:

# Install Tesseract
sudo dnf install tesseract

# Install Poppler
sudo dnf install poppler-utils

# Install language data (example for Sinhala)
sudo dnf install tesseract-langpack-sin

Arch Linux:

# Install Tesseract
sudo pacman -S tesseract

# Install Poppler
sudo pacman -S poppler

# Install language data (example for Sinhala)
sudo pacman -S tesseract-data-sin

Step 4: Run the Translator

python3 pdf_translator.py

🎯 Usage

Basic Usage (GUI Mode - Easiest!)

python pdf_translator.py

This will:

  1. βœ… Automatically install missing Python packages
  2. πŸ“‚ Open file picker for PDF selection
  3. 🌍 Show language selector for source language
  4. 🌍 Show language selector for target language
  5. πŸ“ Ask about document type (exam vs regular)
  6. ✨ Process and save the translation!

Command Line Options

# Specify PDF file directly
python pdf_translator.py document.pdf

# Specify output directory
python pdf_translator.py document.pdf -o /path/to/output

# Force OCR mode (for scanned PDFs)
python pdf_translator.py document.pdf --ocr

# Specify languages directly (skip language selector)
python pdf_translator.py document.pdf --source-lang si --target-lang en

# Format as exam paper (skip dialog)
python pdf_translator.py document.pdf --exam

# Save as both PDF and TXT
python pdf_translator.py document.pdf --format both

# Skip dependency checks (if you know what you're doing)
python pdf_translator.py document.pdf --skip-checks

Full Command Line Example

# Translate Sinhala exam to English, save as PDF
python pdf_translator.py exam.pdf -o ./translated --source-lang si --target-lang en --exam --format pdf

Available Arguments

Argument Description
pdf_path Path to PDF file (optional - opens file picker if not provided)
-o, --output Output directory (default: same as input)
--ocr Force OCR mode for scanned PDFs
--no-auto Disable auto-detection of OCR need
--gui Force GUI file picker even if path is provided
--format Output format: pdf (default), txt, or both
--exam Format as exam paper (skip dialog)
--no-exam Format as regular document (skip dialog)
--source-lang Source language code (e.g., si, fr, de)
--target-lang Target language code (e.g., en, es, zh-CN)
--skip-checks Skip dependency checks (not recommended)

🌍 Supported Languages

The tool supports 100+ languages including:

Language Code Language Code Language Code
Afrikaans af Arabic ar Bengali bn
Chinese (Simplified) zh-CN Chinese (Traditional) zh-TW Czech cs
Danish da Dutch nl English en
Finnish fi French fr German de
Greek el Hebrew he Hindi hi
Hungarian hu Indonesian id Italian it
Japanese ja Korean ko Malay ms
Norwegian no Polish pl Portuguese pt
Romanian ro Russian ru Sinhala si
Spanish es Swedish sv Tamil ta
Telugu te Thai th Turkish tr
Ukrainian uk Urdu ur Vietnamese vi

And 70+ more languages! Use the built-in search feature to find your language.

πŸ“€ Output Files

The script creates the following files:

original_document.pdf
β”œβ”€β”€ original_document_si.txt    (Extracted source text - for reference)
β”œβ”€β”€ original_document_en.txt    (Translated text - backup)
└── original_document_en.pdf    (Translated PDF - main output)

Features:

  • βœ… Maintains exact structure and order of original
  • βœ… Questions remain in same order
  • βœ… Line breaks preserved
  • βœ… Page markers maintained
  • πŸ“ Exam Mode: Tight spacing, question detection (1., 2., a), b), etc.)
  • πŸ“„ Regular Mode: Standard paragraph formatting

Output Format Options:

  • --format pdf (default) - Creates PDF + TXT files
  • --format txt - Creates only TXT files
  • --format both - Creates both formats explicitly

πŸ“š Examples

Example 1: Translate Sinhala Exam to English (GUI Mode)

python pdf_translator.py
# Select file β†’ Select Sinhala β†’ Select English β†’ Select YES for exam paper

Example 2: Translate French Document to Spanish

python pdf_translator.py french_doc.pdf --source-lang fr --target-lang es

Example 3: Translate Scanned Arabic PDF to English

python pdf_translator.py scanned_arabic.pdf --source-lang ar --target-lang en --ocr

Example 4: Batch Process with Command Line

# Translate multiple PDFs
python pdf_translator.py doc1.pdf --source-lang si --target-lang en --exam
python pdf_translator.py doc2.pdf --source-lang si --target-lang en --exam

Example Output:

============================================================
Processing: exam_paper.pdf
============================================================

Checking Python packages...
βœ“ pdfplumber
βœ“ pdf2image
βœ“ Pillow
βœ“ pytesseract
βœ“ deep-translator
βœ“ reportlab
βœ“ All Python packages installed!

============================================================
Checking system dependencies...
============================================================
βœ“ Tesseract OCR found
βœ“ Poppler found
============================================================

Attempting to extract text from PDF...
  βœ“ Extracted text from page 1
  βœ“ Extracted text from page 2
  ...

βœ“ Saved Sinhala text to: exam_paper_si.txt

Translating text from Sinhala to English...
(Language codes: si β†’ en)
  Translating lines 1-10/415... (attempt 1/3)
    βœ“ Success!
  ...

βœ“ Saved English text to: exam_paper_en.txt

Creating PDF document...
  βœ“ PDF created successfully!
βœ“ Saved English PDF to: exam_paper_en.pdf

============================================================
βœ… Translation completed successfully!
============================================================

πŸ”§ Troubleshooting

"No text extracted from PDF"

Problem: The PDF might be scanned/image-based.

Solution:

  1. Install Tesseract and Poppler (see Installation section above)
  2. Run with --ocr flag: python pdf_translator.py --ocr

"Tesseract OCR not found" or "Poppler not found"

Problem: OCR dependencies not installed or not in PATH.

Solution:

  • Check terminal output - The script will show you exactly what's missing
  • See README installation guide for your OS (Windows/macOS/Linux sections above)
  • Windows:
  • macOS: brew install tesseract poppler
  • Linux: sudo apt-get install tesseract-ocr poppler-utils

"Language data not found for [language]"

Problem: Tesseract doesn't have the language data file.

Solution:

  • Windows: Download from tessdata and copy to C:\Program Files\Tesseract-OCR\tessdata\
  • macOS: brew install tesseract-lang
  • Linux: sudo apt-get install tesseract-ocr-[langcode] (e.g., tesseract-ocr-sin for Sinhala)
  • Verify with: tesseract --list-langs

"Translation failed" or API errors

Problem: Rate limiting or network issues.

Solution:

  • The script has built-in retry logic (3 attempts with delays)
  • Wait a few minutes and try again
  • Check your internet connection
  • For very large documents, the script automatically batches translations

"Questions split across pages" (Exam papers)

Problem: Questions breaking across page boundaries.

Solution:

  • Select "YES" when asked if it's an exam paper
  • Or use --exam flag: python pdf_translator.py document.pdf --exam

Missing Python packages

Problem: ImportError or module not found.

Solution:

# The script auto-installs on first run, but you can manually install:
pip install -r requirements.txt

# Or install individually:
pip install pdfplumber pdf2image Pillow pytesseract deep-translator reportlab

Permission errors on Linux/macOS

Problem: Permission denied when installing packages.

Solution:

# Use pip with --user flag
pip install --user -r requirements.txt

# Or use virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Poor OCR quality

Problem: OCR text has many errors.

Solution:

  • Use higher quality PDF scans
  • Try adjusting DPI in the code (change dpi=300 to dpi=400)
  • Make sure correct language data is installed
  • Preprocess images (enhance contrast, remove noise)

⚠️ Important Notes

  • Internet Required: Translation requires internet connection
  • OCR Optional: Only needed for scanned PDFs - regular text PDFs work without it
  • Free Service: Uses Google Translate (may have rate limits for very large documents)
  • Best Results: High-quality PDF scans produce best OCR results

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

πŸ“§ Support

If you encounter any issues:

  1. Check the Troubleshooting section above
  2. Look at the terminal output - it shows exactly what's missing
  3. Follow the installation guide for your OS
  4. Create an issue with:
    • Your operating system
    • Python version (python --version)
    • Error message (copy from terminal)
    • Steps to reproduce

⭐ Star History

If this tool helped you, please consider giving it a star! ⭐


Made with ❀️ for the community

About

A powerful Python tool to translate PDF documents between 100+ languages with support for both regular and scanned PDFs using OCR technology.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published