OCR Text Extractor

A powerful OCR (Optical Character Recognition) tool that uses Google Drive API to extract text from images with advanced features for text processing and combination.

Version 1.0.0: Modular architecture for better maintainability and extensibility!

Features

Multiple Image Format Support: Supports JPG, JPEG, PNG, GIF, BMP, TIFF formats
Automatic Text Cleaning: Removes metadata and cleans extracted text
Flexible Text Combination: Combine texts with or without file headers
Comprehensive Logging: Detailed processing logs with colored output
Error Handling: Robust error handling with detailed reporting
Configurable Processing: Command-line options for customization
Modular Architecture: Clean, maintainable code structure with 8 focused modules
No Duplicate Logging: Clean output without repetitive messages
Progress Tracking: Real-time progress indicators during processing

Prerequisites

Python 3.6+ (recommended: Python 3.8+)
Google Drive API credentials
Internet connection for API access

Setup

Install Python Dependencies

pip3 install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib oauth2client

Get Google Drive API Credentials
- Follow the Python Quickstart Guide
- Download credentials.json file
- Place credentials.json in the project directory
Prepare Images
- Create an images folder in the project directory
- Place your images in the images folder
- Supported formats: JPG, JPEG, PNG, GIF, BMP, TIFF

Preparing Images

Converting PDF Files to Images

If you need to convert PDF documents to images before processing:

PDF-XChange Editor (Free): Recommended free solution for converting PDF pages to supported image formats
- Additional feature: Crop unwanted sections (headers, footers, page numbers) from all images
Online PDF Converters: Various web-based conversion tools available
Desktop Applications: Adobe Acrobat, other PDF utilities
Alternative Tools: Any reliable PDF-to-image conversion software

Image Placement

After conversion, place all resulting images in the images folder within your project directory.

Usage

Basic Usage

python main.py

Advanced Usage with Options

# Combine only processed texts (no raw texts) - this is the default
python main.py --no-combine-texts

# Include raw text combination
python main.py --combine-raw

# Include file headers in combined files
python main.py --include-headers

# Combine both raw and processed texts with headers
python main.py --combine-raw --include-headers

# Specify custom credentials file
python main.py --credentials my_credentials.json

# Support only specific image formats
python main.py --extensions .jpg .jpeg .png

# Enable verbose output for detailed logging
python main.py --verbose

# Enable file logging (creates ocr_processing.log)
python main.py --enable-file-logging

# Check version information
python main.py --version

# Combination example: verbose mode with raw text combination and headers
python main.py --verbose --combine-raw --include-headers

Command Line Options

--credentials PATH: Path to Google credentials JSON file (default: credentials.json)
--no-combine-texts: Do not combine processed text files
--combine-raw: Also combine raw text files
--include-headers: Include file headers in combined files
--extensions LIST: Supported image file extensions
--verbose: Enable verbose logging output with detailed information
--enable-file-logging: Enable logging to file (creates ocr_processing.log)
--version: Show version information and exit

Output Structure

project/
├── images/                 # Input images
├── raw_texts/             # Raw OCR output
├── texts/                 # Cleaned OCR output
├── credentials.json       # Google API credentials (user-provided)
├── token.json            # OAuth token (auto-generated)
└── main.py               # Main script

Text Combination

The tool offers two combination modes:

Without Headers: Simple text concatenation with file separators
With Headers: Detailed file information and structured output

Combined files are saved with timestamps: combined_cleaned_TIMESTAMP.txt or combined_raw_TIMESTAMP.txt

Error Handling

Comprehensive error logging
Graceful handling of API failures
Automatic retry mechanisms
Detailed error reporting

Project Structure

The application has been refactored into a modular architecture:

OCR/
├── images/               # Input images directory
├── raw_texts/            # Raw OCR output directory
├── texts/                # Cleaned OCR output directory
├── __init__.py            # Package initialization
├── auth.py                # Google Drive authentication and service setup
├── cli.py                 # Command line interface and argument parsing
├── config.py              # Configuration classes and constants
├── credentials.json       # Google API credentials (user-provided)
├── logger.py              # Logging utilities with colored output
├── main.py                 # Main entry point
├── ocr_processor.py       # Core OCR processing logic
├── PROJECT_STRUCTURE.md   # Architecture documentation
├── README.md              # User documentation
├── text_processor.py      # Text cleaning and combination utilities
└── token.json            # OAuth token (auto-generated)

For detailed information about the modular architecture, see PROJECT_STRUCTURE.md.

Authentication

On first run, the tool will:

Open a browser for Google OAuth
Request permission to access Google Drive
Save authentication token for future use

Troubleshooting

Credentials Error: Ensure credentials.json is in the project directory
No Images Found: Check image formats and file extensions
API Quota Exceeded: Wait and retry, or check Google Cloud Console quotas
Permission Denied: Re-run OAuth flow by deleting token.json

Performance

Processing time depends on image size and API response
Large images may take longer to process
Multiple files are processed sequentially
Progress tracking shows current file being processed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR Text Extractor

Features

Prerequisites

Setup

Preparing Images

Converting PDF Files to Images

Image Placement

Usage

Basic Usage

Advanced Usage with Options

Command Line Options

Output Structure

Text Combination

Error Handling

Project Structure

Authentication

Troubleshooting

Performance

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
README.md		README.md
__init__.py		__init__.py
auth.py		auth.py
cli.py		cli.py
config.py		config.py
gui.py		gui.py
logger.py		logger.py
main.py		main.py
ocr_processor.py		ocr_processor.py
requirements.txt		requirements.txt
text_processor.py		text_processor.py

License

PhilixTheExplorer/ocr_text_extractor

Folders and files

Latest commit

History

Repository files navigation

OCR Text Extractor

Features

Prerequisites

Setup

Preparing Images

Converting PDF Files to Images

Image Placement

Usage

Basic Usage

Advanced Usage with Options

Command Line Options

Output Structure

Text Combination

Error Handling

Project Structure

Authentication

Troubleshooting

Performance

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages