📄 PDF Translator - Multi-Language PDF Translation Tool

A powerful Python tool to translate PDF documents between 100+ languages with support for both regular and scanned PDFs using OCR technology.

✨ Features

🌍 100+ Languages - Translate between any language pair (Sinhala, English, French, Spanish, Chinese, Arabic, etc.)
📑 Regular PDFs - Extract and translate text-based PDFs
📷 Scanned PDFs - OCR support for image-based/scanned documents
🎨 Smart Formatting - Preserves document structure and formatting
📝 Exam Paper Mode - Special formatting for exams/tests (keeps questions together)
🔍 Language Search - Easy-to-use searchable language selector
🖱️ GUI Interface - User-friendly file and folder pickers
📦 Auto-Install - Automatically installs Python dependencies
💯 Free - Uses Google Translate API (no API key required)
🔄 Retry Logic - Automatically retries failed translations

🚀 Quick Start

# Clone the repository
git clone https://github.com/Monster-ZeroX/PDF-Translator.git
cd PDF-Translator

# Run the translator (Python dependencies auto-install on first run!)
python pdf_translator.py

The script will:

✅ Auto-install any missing Python packages
📂 Open a file picker to select your PDF
🌍 Ask for source language (the PDF's current language)
🌍 Ask for target language (what you want to translate to)
📝 Ask if it's an exam paper (for better formatting)
✨ Process and save the translated PDF!

📥 Installation

Windows

Step 1: Install Python

Download Python from python.org
⚠️ Important: Check "Add Python to PATH" during installation
Verify installation:
```
python --version
```

Step 2: Get the Code

# Clone or download this repository
cd DocTrans

# That's it! Python packages will auto-install when you run the script
# (Optional) Manually install: pip install -r requirements.txt

Step 3: (Optional) Install OCR Dependencies for Scanned PDFs

⚠️ Only needed if you want to translate scanned/image PDFs!
Regular text-based PDFs work perfectly fine without these!

Tesseract OCR:

Download from UB Mannheim Tesseract
Install the .exe file
During installation, select language data for your source language (e.g., Sinhala, Arabic, etc.)
Add to PATH or the script will show you instructions if needed

Poppler:

Download from Poppler Windows
Extract to C:\poppler (or any location)
Add C:\poppler\Library\bin to your system PATH:
- Right-click "This PC" → Properties → Advanced System Settings
- Environment Variables → System Variables → Path → Edit
- Add new entry: C:\poppler\Library\bin

Step 4: Run the Translator

# Simple way - just run it!
python pdf_translator.py

# Or use the batch file
run_translator.bat

macOS

Step 1: Install Homebrew (if not installed)

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Step 2: Install Python (if not installed)

brew install python
python3 --version

Step 3: Get the Code

cd DocTrans

# That's it! Python packages will auto-install when you run the script
# (Optional) Manually install: pip3 install -r requirements.txt

Step 4: (Optional) Install OCR Dependencies for Scanned PDFs

⚠️ Only needed for scanned/image PDFs!
Regular text-based PDFs work perfectly fine without these!

# Install Tesseract OCR
brew install tesseract

# Install Poppler
brew install poppler

# Install additional language data for Tesseract (if needed)
# The script will show you instructions if specific language data is missing

Step 5: Run the Translator

python3 pdf_translator.py

Linux

Step 1: Install Python (usually pre-installed)

# Check Python version
python3 --version

# If not installed (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install python3 python3-pip

# For Fedora/RHEL
sudo dnf install python3 python3-pip

Step 2: Get the Code

cd DocTrans

# That's it! Python packages will auto-install when you run the script
# (Optional) Manually install: pip3 install -r requirements.txt

Step 3: (Optional) Install OCR Dependencies for Scanned PDFs

⚠️ Only needed for scanned/image PDFs!
Regular text-based PDFs work perfectly fine without these!

Ubuntu/Debian:

# Install Tesseract OCR
sudo apt-get update
sudo apt-get install tesseract-ocr

# Install Poppler
sudo apt-get install poppler-utils

# Install language data (example for Sinhala)
sudo apt-get install tesseract-ocr-sin

Fedora/RHEL:

# Install Tesseract
sudo dnf install tesseract

# Install Poppler
sudo dnf install poppler-utils

# Install language data (example for Sinhala)
sudo dnf install tesseract-langpack-sin

Arch Linux:

# Install Tesseract
sudo pacman -S tesseract

# Install Poppler
sudo pacman -S poppler

# Install language data (example for Sinhala)
sudo pacman -S tesseract-data-sin

Step 4: Run the Translator

python3 pdf_translator.py

🎯 Usage

Basic Usage (GUI Mode - Easiest!)

python pdf_translator.py

This will:

✅ Automatically install missing Python packages
📂 Open file picker for PDF selection
🌍 Show language selector for source language
🌍 Show language selector for target language
📝 Ask about document type (exam vs regular)
✨ Process and save the translation!

Command Line Options

# Specify PDF file directly
python pdf_translator.py document.pdf

# Specify output directory
python pdf_translator.py document.pdf -o /path/to/output

# Force OCR mode (for scanned PDFs)
python pdf_translator.py document.pdf --ocr

# Specify languages directly (skip language selector)
python pdf_translator.py document.pdf --source-lang si --target-lang en

# Format as exam paper (skip dialog)
python pdf_translator.py document.pdf --exam

# Save as both PDF and TXT
python pdf_translator.py document.pdf --format both

# Skip dependency checks (if you know what you're doing)
python pdf_translator.py document.pdf --skip-checks

Full Command Line Example

# Translate Sinhala exam to English, save as PDF
python pdf_translator.py exam.pdf -o ./translated --source-lang si --target-lang en --exam --format pdf

Available Arguments

Argument	Description
`pdf_path`	Path to PDF file (optional - opens file picker if not provided)
`-o, --output`	Output directory (default: same as input)
`--ocr`	Force OCR mode for scanned PDFs
`--no-auto`	Disable auto-detection of OCR need
`--gui`	Force GUI file picker even if path is provided
`--format`	Output format: `pdf` (default), `txt`, or `both`
`--exam`	Format as exam paper (skip dialog)
`--no-exam`	Format as regular document (skip dialog)
`--source-lang`	Source language code (e.g., `si`, `fr`, `de`)
`--target-lang`	Target language code (e.g., `en`, `es`, `zh-CN`)
`--skip-checks`	Skip dependency checks (not recommended)

🌍 Supported Languages

The tool supports 100+ languages including:

Language	Code	Language	Code	Language	Code
Afrikaans	`af`	Arabic	`ar`	Bengali	`bn`
Chinese (Simplified)	`zh-CN`	Chinese (Traditional)	`zh-TW`	Czech	`cs`
Danish	`da`	Dutch	`nl`	English	`en`
Finnish	`fi`	French	`fr`	German	`de`
Greek	`el`	Hebrew	`he`	Hindi	`hi`
Hungarian	`hu`	Indonesian	`id`	Italian	`it`
Japanese	`ja`	Korean	`ko`	Malay	`ms`
Norwegian	`no`	Polish	`pl`	Portuguese	`pt`
Romanian	`ro`	Russian	`ru`	Sinhala	`si`
Spanish	`es`	Swedish	`sv`	Tamil	`ta`
Telugu	`te`	Thai	`th`	Turkish	`tr`
Ukrainian	`uk`	Urdu	`ur`	Vietnamese	`vi`

And 70+ more languages! Use the built-in search feature to find your language.

📤 Output Files

The script creates the following files:

original_document.pdf
├── original_document_si.txt    (Extracted source text - for reference)
├── original_document_en.txt    (Translated text - backup)
└── original_document_en.pdf    (Translated PDF - main output)

Features:

✅ Maintains exact structure and order of original
✅ Questions remain in same order
✅ Line breaks preserved
✅ Page markers maintained
📝 Exam Mode: Tight spacing, question detection (1., 2., a), b), etc.)
📄 Regular Mode: Standard paragraph formatting

Output Format Options:

--format pdf (default) - Creates PDF + TXT files
--format txt - Creates only TXT files
--format both - Creates both formats explicitly

📚 Examples

Example 1: Translate Sinhala Exam to English (GUI Mode)

python pdf_translator.py
# Select file → Select Sinhala → Select English → Select YES for exam paper

Example 2: Translate French Document to Spanish

python pdf_translator.py french_doc.pdf --source-lang fr --target-lang es

Example 3: Translate Scanned Arabic PDF to English

python pdf_translator.py scanned_arabic.pdf --source-lang ar --target-lang en --ocr

Example 4: Batch Process with Command Line

# Translate multiple PDFs
python pdf_translator.py doc1.pdf --source-lang si --target-lang en --exam
python pdf_translator.py doc2.pdf --source-lang si --target-lang en --exam

Example Output:

============================================================
Processing: exam_paper.pdf
============================================================

Checking Python packages...
✓ pdfplumber
✓ pdf2image
✓ Pillow
✓ pytesseract
✓ deep-translator
✓ reportlab
✓ All Python packages installed!

============================================================
Checking system dependencies...
============================================================
✓ Tesseract OCR found
✓ Poppler found
============================================================

Attempting to extract text from PDF...
  ✓ Extracted text from page 1
  ✓ Extracted text from page 2
  ...

✓ Saved Sinhala text to: exam_paper_si.txt

Translating text from Sinhala to English...
(Language codes: si → en)
  Translating lines 1-10/415... (attempt 1/3)
    ✓ Success!
  ...

✓ Saved English text to: exam_paper_en.txt

Creating PDF document...
  ✓ PDF created successfully!
✓ Saved English PDF to: exam_paper_en.pdf

============================================================
✅ Translation completed successfully!
============================================================

🔧 Troubleshooting

"No text extracted from PDF"

Problem: The PDF might be scanned/image-based.

Solution:

Install Tesseract and Poppler (see Installation section above)
Run with --ocr flag: python pdf_translator.py --ocr

"Tesseract OCR not found" or "Poppler not found"

Problem: OCR dependencies not installed or not in PATH.

Solution:

Check terminal output - The script will show you exactly what's missing
See README installation guide for your OS (Windows/macOS/Linux sections above)
Windows:
- Tesseract: Download from UB Mannheim
- Poppler: Download from Poppler Windows
- Add to PATH or place in common locations
macOS: brew install tesseract poppler
Linux: sudo apt-get install tesseract-ocr poppler-utils

"Language data not found for [language]"

Problem: Tesseract doesn't have the language data file.

Solution:

Windows: Download from tessdata and copy to C:\Program Files\Tesseract-OCR\tessdata\
macOS: brew install tesseract-lang
Linux: sudo apt-get install tesseract-ocr-[langcode] (e.g., tesseract-ocr-sin for Sinhala)
Verify with: tesseract --list-langs

"Translation failed" or API errors

Problem: Rate limiting or network issues.

Solution:

The script has built-in retry logic (3 attempts with delays)
Wait a few minutes and try again
Check your internet connection
For very large documents, the script automatically batches translations

"Questions split across pages" (Exam papers)

Problem: Questions breaking across page boundaries.

Solution:

Select "YES" when asked if it's an exam paper
Or use --exam flag: python pdf_translator.py document.pdf --exam

Missing Python packages

Problem: ImportError or module not found.

Solution:

# The script auto-installs on first run, but you can manually install:
pip install -r requirements.txt

# Or install individually:
pip install pdfplumber pdf2image Pillow pytesseract deep-translator reportlab

Permission errors on Linux/macOS

Problem: Permission denied when installing packages.

Solution:

# Use pip with --user flag
pip install --user -r requirements.txt

# Or use virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Poor OCR quality

Problem: OCR text has many errors.

Solution:

Use higher quality PDF scans
Try adjusting DPI in the code (change dpi=300 to dpi=400)
Make sure correct language data is installed
Preprocess images (enhance contrast, remove noise)

⚠️ Important Notes

Internet Required: Translation requires internet connection
OCR Optional: Only needed for scanned PDFs - regular text PDFs work without it
Free Service: Uses Google Translate (may have rate limits for very large documents)
Best Results: High-quality PDF scans produce best OCR results

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Google Translate for the translation API
Tesseract OCR for OCR capabilities
pdfplumber for PDF text extraction
All the amazing open-source libraries used in this project

📧 Support

If you encounter any issues:

Check the Troubleshooting section above
Look at the terminal output - it shows exactly what's missing
Follow the installation guide for your OS
Create an issue with:
- Your operating system
- Python version (python --version)
- Error message (copy from terminal)
- Steps to reproduce

⭐ Star History

If this tool helped you, please consider giving it a star! ⭐

Made with ❤️ for the community

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
CHANGES.md		CHANGES.md
LICENSE		LICENSE
README.md		README.md
example_usage.py		example_usage.py
pdf_translator.py		pdf_translator.py
requirements.txt		requirements.txt
run_translator.bat		run_translator.bat

License

Monster-ZeroX/PDF-Translator

Folders and files

Latest commit

History

Repository files navigation

📄 PDF Translator - Multi-Language PDF Translation Tool

✨ Features

📋 Table of Contents

🚀 Quick Start

📥 Installation

Windows

Step 1: Install Python

Step 2: Get the Code

Step 3: (Optional) Install OCR Dependencies for Scanned PDFs

Step 4: Run the Translator

macOS

Step 1: Install Homebrew (if not installed)

Step 2: Install Python (if not installed)

Step 3: Get the Code

Step 4: (Optional) Install OCR Dependencies for Scanned PDFs

Step 5: Run the Translator

Linux

Step 1: Install Python (usually pre-installed)

Step 2: Get the Code

Step 3: (Optional) Install OCR Dependencies for Scanned PDFs

Step 4: Run the Translator

🎯 Usage

Basic Usage (GUI Mode - Easiest!)

Command Line Options

Full Command Line Example

Available Arguments

🌍 Supported Languages

📤 Output Files

📚 Examples

Example 1: Translate Sinhala Exam to English (GUI Mode)

Example 2: Translate French Document to Spanish

Example 3: Translate Scanned Arabic PDF to English

Example 4: Batch Process with Command Line

Example Output:

🔧 Troubleshooting

"No text extracted from PDF"

"Tesseract OCR not found" or "Poppler not found"

"Language data not found for [language]"

"Translation failed" or API errors

"Questions split across pages" (Exam papers)

Missing Python packages

Permission errors on Linux/macOS

Poor OCR quality

⚠️ Important Notes

🤝 Contributing

📝 License

🙏 Acknowledgments

📧 Support

⭐ Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages