OCR-DLP Image Crawler & Dataset Generator

Create production-ready, GPT-4V-labeled datasets for OCR and Data-Loss-Prevention models in 3 minutes.

🚀 Quick Start

# 1 Install
git clone https://github.com/Jackmeson1/ocrdlp-lab && cd ocrdlp-lab
pip install -r requirements.txt          # Python 3.8+

# 2 Set API keys
export SERPER_API_KEY=...
export OPENAI_API_KEY=...

# 3 One-shot pipeline demo
python ocrdlp.py pipeline "invoice document" \
       --output-dir ./datasets/invoices --limit 50
open datasets/invoices/labels/invoice_dataset_labels.jsonl

_{Need details? Jump to Configuration • CLI Commands}

🎯 Purpose

This is a CRAWLER APPLICATION that generates LABELED DATASETS for downstream model training. It crawls images from various sources and uses AI to generate comprehensive labels for training OCR, DLP, and document classification models.

Workflow: CRAWLER → LABELED DATASET → DOWNSTREAM MODEL TRAINING

Key Features

🔍 Multi-Engine Image Search: Serper, Google, Bing, DuckDuckGo integration
📥 Robust Image Download: Automated download with validation and error handling
🤖 AI-Powered Labeling: GPT-4V integration for intelligent document labeling
🏷️ Comprehensive Labels: Multi-purpose labels for OCR, DLP, and classification training
📁 Dataset Generation: Creates production-ready datasets with standard structure
⚡ CLI Interface: Professional command-line tool with subcommands
✅ Quality Validation: Built-in dataset quality checks
⏳ Rate-Limit Handling: Automatic retries with exponential backoff

🚀 Quick Start

Prerequisites

Python 3.8+
OpenAI API key (for GPT-4V labeling)
Serper API key (for image search)

Installation

Clone the repository

git clone <repository-url>
cd ocrdlp-lab

Install dependencies
```
pip install -r requirements.txt
```

Set environment variables

# Windows
set SERPER_API_KEY=your_serper_api_key
set OPENAI_API_KEY=your_openai_api_key

# Linux/Mac
export SERPER_API_KEY=your_serper_api_key
export OPENAI_API_KEY=your_openai_api_key

Verify Installation

python ocrdlp.py --help

📋 Dataset Generation Workflow

1. Search for Images

# Search for document images
python ocrdlp.py search "invoice documents" --engine serper --limit 100 --output urls.txt

2. Download Images

# Download images directly into the dataset
python ocrdlp.py download --urls-file urls.txt --output-dir ./datasets/invoice_dataset/images

3. Generate Dataset with Labels

# Create label directory
mkdir -p datasets/invoice_dataset/labels

# Generate comprehensive labels
python ocrdlp.py classify datasets/invoice_dataset/images --output datasets/invoice_dataset/labels/invoice_dataset_labels.jsonl

4. Validate Dataset Quality

# Validate generated dataset
python ocrdlp.py validate datasets/invoice_dataset/labels/invoice_dataset_labels.jsonl

🏗️ Generated Dataset Structure

datasets/
└── invoice_dataset/
    ├── images/              # Raw images for training
    │   ├── image_001.jpg
    │   ├── image_002.png
    │   └── ...
    ├── labels/              # AI-generated labels
    │   ├── <dataset>_labels.jsonl     # Comprehensive labels
    │   └── <dataset>_labels_summary.md       # Dataset statistics
    └── README.md            # Usage instructions for ML engineers

📊 Label Schema

Each image gets comprehensive labels for multiple downstream use cases:

Document Classification: document_category, document_subcategory
OCR Training: ocr_difficulty, text_clarity, language_primary
DLP Training: sensitive_data_types, testing_scenarios
Quality Assessment: image_quality, background_complexity
Processing Hints: recommended_preprocessing, challenge_factors

🎯 Downstream Model Usage

OCR Model Training

import json

def load_ocr_dataset(dataset_path):
    labels_path = f"{dataset_path}/labels/<dataset>_labels.jsonl"
    with open(labels_path, 'r') as f:
        labels = [json.loads(line) for line in f]
    
    return [{
        'image_path': label['_file_info']['file_path'],
        'difficulty': label['ocr_difficulty'],
        'text_clarity': label['text_clarity'],
        'language': label['language_primary']
    } for label in labels]

DLP Model Training

def load_dlp_dataset(dataset_path):
    labels_path = f"{dataset_path}/labels/<dataset>_labels.jsonl"
    with open(labels_path, 'r') as f:
        labels = [json.loads(line) for line in f]
    
    return [{
        'image_path': label['_file_info']['file_path'],
        'sensitive_data': label['sensitive_data_types'],
        'document_type': label['document_category']
    } for label in labels]

🔧 CLI Commands

Search Command

python ocrdlp.py search "document type" --engine serper --limit 50 --output urls.txt

Download Command

python ocrdlp.py download --urls-file urls.txt --output-dir ./images
# OR
python ocrdlp.py download --query "invoice" --output-dir ./images --limit 20

Classify Command

python ocrdlp.py classify ./images --output invoice_labels.jsonl --validate

Pipeline Command (Complete Workflow)

python ocrdlp.py pipeline "invoice documents" --output-dir ./invoice_dataset --limit 50

Validate Command

python ocrdlp.py validate invoice_labels.jsonl

🎉 Example: Creating Invoice Dataset

# Complete workflow to create invoice training dataset
python ocrdlp.py pipeline "invoice documents" --output-dir ./datasets/invoices --limit 100

# Dataset is now ready at ./datasets/invoices/
# - images/ contains downloaded invoice images
    # - labels/invoice_dataset_labels.jsonl contains comprehensive labels

🔌 Offline Usage

The unit tests simulate the entire pipeline without making real network calls. This is useful when API access is unavailable.

pip install -r requirements.txt
pytest

To try the CLI with predownloaded images, set dummy API keys and point the download command at a text file of image URLs:

export SERPER_API_KEY=dummy
export OPENAI_API_KEY=dummy
ocrdlp download --urls-file sample_urls.txt --output-dir ./offline_demo
ocrdlp classify ./offline_demo/images --output offline_labels.jsonl

🛠️ Development

Run Tests

python test_viability.py
python test_image_labeling.py

Direct Labeling (Alternative)

python gpt4v_image_labeler.py ./images invoice_labels.jsonl

📈 Key Benefits

Automated Dataset Creation - No manual labeling required
Multi-Purpose Labels - One dataset serves multiple model types
Production-Ready - Standard ML dataset format
Scalable - Can generate thousands of labeled images
Quality Assured - Built-in validation and quality checks

This is a CRAWLER for DATASET GENERATION, not a model evaluation tool.

Generated datasets are ready for use in training OCR, DLP, and document classification models.

⚠️ Image Rights Reminder

When collecting images from search engines or photo sites (Google Images, Unsplash, etc.), verify that you have permission to use and redistribute each image. Check the licensing terms and respect copyright restrictions before sharing any dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
__pycache__		__pycache__
crawler		crawler
datasets		datasets
docs		docs
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CRAWLER_MVP.md		CRAWLER_MVP.md
FINAL_TEST_VERIFICATION.md		FINAL_TEST_VERIFICATION.md
README.md		README.md
README_IMAGE_LABELING.md		README_IMAGE_LABELING.md
demo_labeling_vs_extraction.py		demo_labeling_vs_extraction.py
gpt4v_analyzer.py		gpt4v_analyzer.py
gpt4v_image_labeler.py		gpt4v_image_labeler.py
http_client.py		http_client.py
ocrdlp.bat		ocrdlp.bat
ocrdlp.py		ocrdlp.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_image_labeling_test.bat		run_image_labeling_test.bat
test_image_labeling.py		test_image_labeling.py
test_viability.py		test_viability.py

Jackmeson1/ocrdlp-lab

Folders and files

Latest commit

History

Repository files navigation

OCR-DLP Image Crawler & Dataset Generator

🚀 Quick Start

🎯 Purpose

Key Features

🚀 Quick Start

Prerequisites

Installation

Verify Installation

📋 Dataset Generation Workflow

1. Search for Images

2. Download Images

3. Generate Dataset with Labels

4. Validate Dataset Quality

🏗️ Generated Dataset Structure

📊 Label Schema

🎯 Downstream Model Usage

OCR Model Training

DLP Model Training

🔧 CLI Commands

Search Command

Download Command

Classify Command

Pipeline Command (Complete Workflow)

Validate Command

🎉 Example: Creating Invoice Dataset

🔌 Offline Usage

🛠️ Development

Run Tests

Direct Labeling (Alternative)

📈 Key Benefits

⚠️ Image Rights Reminder

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages