A powerful Python tool to translate PDF documents between 100+ languages with support for both regular and scanned PDFs using OCR technology.
- π 100+ Languages - Translate between any language pair (Sinhala, English, French, Spanish, Chinese, Arabic, etc.)
- π Regular PDFs - Extract and translate text-based PDFs
- π· Scanned PDFs - OCR support for image-based/scanned documents
- π¨ Smart Formatting - Preserves document structure and formatting
- π Exam Paper Mode - Special formatting for exams/tests (keeps questions together)
- π Language Search - Easy-to-use searchable language selector
- π±οΈ GUI Interface - User-friendly file and folder pickers
- π¦ Auto-Install - Automatically installs Python dependencies
- π― Free - Uses Google Translate API (no API key required)
- π Retry Logic - Automatically retries failed translations
# Clone the repository
git clone https://github.com/Monster-ZeroX/PDF-Translator.git
cd PDF-Translator
# Run the translator (Python dependencies auto-install on first run!)
python pdf_translator.pyThe script will:
- β Auto-install any missing Python packages
- π Open a file picker to select your PDF
- π Ask for source language (the PDF's current language)
- π Ask for target language (what you want to translate to)
- π Ask if it's an exam paper (for better formatting)
- β¨ Process and save the translated PDF!
- Download Python from python.org
β οΈ Important: Check "Add Python to PATH" during installation- Verify installation:
python --version
# Clone or download this repository
cd DocTrans
# That's it! Python packages will auto-install when you run the script
# (Optional) Manually install: pip install -r requirements.txt
Regular text-based PDFs work perfectly fine without these!
Tesseract OCR:
- Download from UB Mannheim Tesseract
- Install the
.exefile - During installation, select language data for your source language (e.g., Sinhala, Arabic, etc.)
- Add to PATH or the script will show you instructions if needed
Poppler:
- Download from Poppler Windows
- Extract to
C:\poppler(or any location) - Add
C:\poppler\Library\binto your system PATH:- Right-click "This PC" β Properties β Advanced System Settings
- Environment Variables β System Variables β Path β Edit
- Add new entry:
C:\poppler\Library\bin
# Simple way - just run it!
python pdf_translator.py
# Or use the batch file
run_translator.bat/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"brew install python
python3 --versioncd DocTrans
# That's it! Python packages will auto-install when you run the script
# (Optional) Manually install: pip3 install -r requirements.txt
Regular text-based PDFs work perfectly fine without these!
# Install Tesseract OCR
brew install tesseract
# Install Poppler
brew install poppler
# Install additional language data for Tesseract (if needed)
# The script will show you instructions if specific language data is missingpython3 pdf_translator.py# Check Python version
python3 --version
# If not installed (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install python3 python3-pip
# For Fedora/RHEL
sudo dnf install python3 python3-pipcd DocTrans
# That's it! Python packages will auto-install when you run the script
# (Optional) Manually install: pip3 install -r requirements.txt
Regular text-based PDFs work perfectly fine without these!
Ubuntu/Debian:
# Install Tesseract OCR
sudo apt-get update
sudo apt-get install tesseract-ocr
# Install Poppler
sudo apt-get install poppler-utils
# Install language data (example for Sinhala)
sudo apt-get install tesseract-ocr-sinFedora/RHEL:
# Install Tesseract
sudo dnf install tesseract
# Install Poppler
sudo dnf install poppler-utils
# Install language data (example for Sinhala)
sudo dnf install tesseract-langpack-sinArch Linux:
# Install Tesseract
sudo pacman -S tesseract
# Install Poppler
sudo pacman -S poppler
# Install language data (example for Sinhala)
sudo pacman -S tesseract-data-sinpython3 pdf_translator.pypython pdf_translator.pyThis will:
- β Automatically install missing Python packages
- π Open file picker for PDF selection
- π Show language selector for source language
- π Show language selector for target language
- π Ask about document type (exam vs regular)
- β¨ Process and save the translation!
# Specify PDF file directly
python pdf_translator.py document.pdf
# Specify output directory
python pdf_translator.py document.pdf -o /path/to/output
# Force OCR mode (for scanned PDFs)
python pdf_translator.py document.pdf --ocr
# Specify languages directly (skip language selector)
python pdf_translator.py document.pdf --source-lang si --target-lang en
# Format as exam paper (skip dialog)
python pdf_translator.py document.pdf --exam
# Save as both PDF and TXT
python pdf_translator.py document.pdf --format both
# Skip dependency checks (if you know what you're doing)
python pdf_translator.py document.pdf --skip-checks# Translate Sinhala exam to English, save as PDF
python pdf_translator.py exam.pdf -o ./translated --source-lang si --target-lang en --exam --format pdf| Argument | Description |
|---|---|
pdf_path |
Path to PDF file (optional - opens file picker if not provided) |
-o, --output |
Output directory (default: same as input) |
--ocr |
Force OCR mode for scanned PDFs |
--no-auto |
Disable auto-detection of OCR need |
--gui |
Force GUI file picker even if path is provided |
--format |
Output format: pdf (default), txt, or both |
--exam |
Format as exam paper (skip dialog) |
--no-exam |
Format as regular document (skip dialog) |
--source-lang |
Source language code (e.g., si, fr, de) |
--target-lang |
Target language code (e.g., en, es, zh-CN) |
--skip-checks |
Skip dependency checks (not recommended) |
The tool supports 100+ languages including:
| Language | Code | Language | Code | Language | Code |
|---|---|---|---|---|---|
| Afrikaans | af |
Arabic | ar |
Bengali | bn |
| Chinese (Simplified) | zh-CN |
Chinese (Traditional) | zh-TW |
Czech | cs |
| Danish | da |
Dutch | nl |
English | en |
| Finnish | fi |
French | fr |
German | de |
| Greek | el |
Hebrew | he |
Hindi | hi |
| Hungarian | hu |
Indonesian | id |
Italian | it |
| Japanese | ja |
Korean | ko |
Malay | ms |
| Norwegian | no |
Polish | pl |
Portuguese | pt |
| Romanian | ro |
Russian | ru |
Sinhala | si |
| Spanish | es |
Swedish | sv |
Tamil | ta |
| Telugu | te |
Thai | th |
Turkish | tr |
| Ukrainian | uk |
Urdu | ur |
Vietnamese | vi |
And 70+ more languages! Use the built-in search feature to find your language.
The script creates the following files:
original_document.pdf
βββ original_document_si.txt (Extracted source text - for reference)
βββ original_document_en.txt (Translated text - backup)
βββ original_document_en.pdf (Translated PDF - main output)
Features:
- β Maintains exact structure and order of original
- β Questions remain in same order
- β Line breaks preserved
- β Page markers maintained
- π Exam Mode: Tight spacing, question detection (1., 2., a), b), etc.)
- π Regular Mode: Standard paragraph formatting
Output Format Options:
--format pdf(default) - Creates PDF + TXT files--format txt- Creates only TXT files--format both- Creates both formats explicitly
python pdf_translator.py
# Select file β Select Sinhala β Select English β Select YES for exam paperpython pdf_translator.py french_doc.pdf --source-lang fr --target-lang espython pdf_translator.py scanned_arabic.pdf --source-lang ar --target-lang en --ocr# Translate multiple PDFs
python pdf_translator.py doc1.pdf --source-lang si --target-lang en --exam
python pdf_translator.py doc2.pdf --source-lang si --target-lang en --exam============================================================
Processing: exam_paper.pdf
============================================================
Checking Python packages...
β pdfplumber
β pdf2image
β Pillow
β pytesseract
β deep-translator
β reportlab
β All Python packages installed!
============================================================
Checking system dependencies...
============================================================
β Tesseract OCR found
β Poppler found
============================================================
Attempting to extract text from PDF...
β Extracted text from page 1
β Extracted text from page 2
...
β Saved Sinhala text to: exam_paper_si.txt
Translating text from Sinhala to English...
(Language codes: si β en)
Translating lines 1-10/415... (attempt 1/3)
β Success!
...
β Saved English text to: exam_paper_en.txt
Creating PDF document...
β PDF created successfully!
β Saved English PDF to: exam_paper_en.pdf
============================================================
β
Translation completed successfully!
============================================================
Problem: The PDF might be scanned/image-based.
Solution:
- Install Tesseract and Poppler (see Installation section above)
- Run with
--ocrflag:python pdf_translator.py --ocr
Problem: OCR dependencies not installed or not in PATH.
Solution:
- Check terminal output - The script will show you exactly what's missing
- See README installation guide for your OS (Windows/macOS/Linux sections above)
- Windows:
- Tesseract: Download from UB Mannheim
- Poppler: Download from Poppler Windows
- Add to PATH or place in common locations
- macOS:
brew install tesseract poppler - Linux:
sudo apt-get install tesseract-ocr poppler-utils
Problem: Tesseract doesn't have the language data file.
Solution:
- Windows: Download from tessdata and copy to
C:\Program Files\Tesseract-OCR\tessdata\ - macOS:
brew install tesseract-lang - Linux:
sudo apt-get install tesseract-ocr-[langcode](e.g.,tesseract-ocr-sinfor Sinhala) - Verify with:
tesseract --list-langs
Problem: Rate limiting or network issues.
Solution:
- The script has built-in retry logic (3 attempts with delays)
- Wait a few minutes and try again
- Check your internet connection
- For very large documents, the script automatically batches translations
Problem: Questions breaking across page boundaries.
Solution:
- Select "YES" when asked if it's an exam paper
- Or use
--examflag:python pdf_translator.py document.pdf --exam
Problem: ImportError or module not found.
Solution:
# The script auto-installs on first run, but you can manually install:
pip install -r requirements.txt
# Or install individually:
pip install pdfplumber pdf2image Pillow pytesseract deep-translator reportlabProblem: Permission denied when installing packages.
Solution:
# Use pip with --user flag
pip install --user -r requirements.txt
# Or use virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtProblem: OCR text has many errors.
Solution:
- Use higher quality PDF scans
- Try adjusting DPI in the code (change
dpi=300todpi=400) - Make sure correct language data is installed
- Preprocess images (enhance contrast, remove noise)
- Internet Required: Translation requires internet connection
- OCR Optional: Only needed for scanned PDFs - regular text PDFs work without it
- Free Service: Uses Google Translate (may have rate limits for very large documents)
- Best Results: High-quality PDF scans produce best OCR results
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Google Translate for the translation API
- Tesseract OCR for OCR capabilities
- pdfplumber for PDF text extraction
- All the amazing open-source libraries used in this project
If you encounter any issues:
- Check the Troubleshooting section above
- Look at the terminal output - it shows exactly what's missing
- Follow the installation guide for your OS
- Create an issue with:
- Your operating system
- Python version (
python --version) - Error message (copy from terminal)
- Steps to reproduce
If this tool helped you, please consider giving it a star! β
Made with β€οΈ for the community