Skip to content

CLI toolkit to crawl images and build GPT-4V-labeled datasets for OCR & DLP model training

Notifications You must be signed in to change notification settings

Jackmeson1/ocrdlp-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

61 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

OCR-DLP Image Crawler & Dataset Generator

Create production-ready, GPT-4V-labeled datasets for OCR and Data-Loss-Prevention models in 3 minutes.

Workflow: crawl→label→train

Python License: MIT CI

πŸš€ Quick Start

# 1 Install
git clone https://github.com/Jackmeson1/ocrdlp-lab && cd ocrdlp-lab
pip install -r requirements.txt          # Python 3.8+

# 2 Set API keys
export SERPER_API_KEY=...
export OPENAI_API_KEY=...

# 3 One-shot pipeline demo
python ocrdlp.py pipeline "invoice document" \
       --output-dir ./datasets/invoices --limit 50
open datasets/invoices/labels/invoice_dataset_labels.jsonl

Need details? Jump to Configuration β€’ CLI Commands

🎯 Purpose

This is a CRAWLER APPLICATION that generates LABELED DATASETS for downstream model training. It crawls images from various sources and uses AI to generate comprehensive labels for training OCR, DLP, and document classification models.

Workflow: CRAWLER β†’ LABELED DATASET β†’ DOWNSTREAM MODEL TRAINING

Key Features

  • πŸ” Multi-Engine Image Search: Serper, Google, Bing, DuckDuckGo integration
  • πŸ“₯ Robust Image Download: Automated download with validation and error handling
  • πŸ€– AI-Powered Labeling: GPT-4V integration for intelligent document labeling
  • 🏷️ Comprehensive Labels: Multi-purpose labels for OCR, DLP, and classification training
  • πŸ“ Dataset Generation: Creates production-ready datasets with standard structure
  • ⚑ CLI Interface: Professional command-line tool with subcommands
  • βœ… Quality Validation: Built-in dataset quality checks
  • ⏳ Rate-Limit Handling: Automatic retries with exponential backoff

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • OpenAI API key (for GPT-4V labeling)
  • Serper API key (for image search)

Installation

  1. Clone the repository

    git clone <repository-url>
    cd ocrdlp-lab
  2. Install dependencies

    pip install -r requirements.txt
  3. Set environment variables

    # Windows
    set SERPER_API_KEY=your_serper_api_key
    set OPENAI_API_KEY=your_openai_api_key
    
    # Linux/Mac
    export SERPER_API_KEY=your_serper_api_key
    export OPENAI_API_KEY=your_openai_api_key

Verify Installation

python ocrdlp.py --help

πŸ“‹ Dataset Generation Workflow

1. Search for Images

# Search for document images
python ocrdlp.py search "invoice documents" --engine serper --limit 100 --output urls.txt

2. Download Images

# Download images directly into the dataset
python ocrdlp.py download --urls-file urls.txt --output-dir ./datasets/invoice_dataset/images

3. Generate Dataset with Labels

# Create label directory
mkdir -p datasets/invoice_dataset/labels

# Generate comprehensive labels
python ocrdlp.py classify datasets/invoice_dataset/images --output datasets/invoice_dataset/labels/invoice_dataset_labels.jsonl

4. Validate Dataset Quality

# Validate generated dataset
python ocrdlp.py validate datasets/invoice_dataset/labels/invoice_dataset_labels.jsonl

πŸ—οΈ Generated Dataset Structure

datasets/
└── invoice_dataset/
    β”œβ”€β”€ images/              # Raw images for training
    β”‚   β”œβ”€β”€ image_001.jpg
    β”‚   β”œβ”€β”€ image_002.png
    β”‚   └── ...
    β”œβ”€β”€ labels/              # AI-generated labels
    β”‚   β”œβ”€β”€ <dataset>_labels.jsonl     # Comprehensive labels
    β”‚   └── <dataset>_labels_summary.md       # Dataset statistics
    └── README.md            # Usage instructions for ML engineers

πŸ“Š Label Schema

Each image gets comprehensive labels for multiple downstream use cases:

  • Document Classification: document_category, document_subcategory
  • OCR Training: ocr_difficulty, text_clarity, language_primary
  • DLP Training: sensitive_data_types, testing_scenarios
  • Quality Assessment: image_quality, background_complexity
  • Processing Hints: recommended_preprocessing, challenge_factors

🎯 Downstream Model Usage

OCR Model Training

import json

def load_ocr_dataset(dataset_path):
    labels_path = f"{dataset_path}/labels/<dataset>_labels.jsonl"
    with open(labels_path, 'r') as f:
        labels = [json.loads(line) for line in f]
    
    return [{
        'image_path': label['_file_info']['file_path'],
        'difficulty': label['ocr_difficulty'],
        'text_clarity': label['text_clarity'],
        'language': label['language_primary']
    } for label in labels]

DLP Model Training

def load_dlp_dataset(dataset_path):
    labels_path = f"{dataset_path}/labels/<dataset>_labels.jsonl"
    with open(labels_path, 'r') as f:
        labels = [json.loads(line) for line in f]
    
    return [{
        'image_path': label['_file_info']['file_path'],
        'sensitive_data': label['sensitive_data_types'],
        'document_type': label['document_category']
    } for label in labels]

πŸ”§ CLI Commands

Search Command

python ocrdlp.py search "document type" --engine serper --limit 50 --output urls.txt

Download Command

python ocrdlp.py download --urls-file urls.txt --output-dir ./images
# OR
python ocrdlp.py download --query "invoice" --output-dir ./images --limit 20

Classify Command

python ocrdlp.py classify ./images --output invoice_labels.jsonl --validate

Pipeline Command (Complete Workflow)

python ocrdlp.py pipeline "invoice documents" --output-dir ./invoice_dataset --limit 50

Validate Command

python ocrdlp.py validate invoice_labels.jsonl

πŸŽ‰ Example: Creating Invoice Dataset

# Complete workflow to create invoice training dataset
python ocrdlp.py pipeline "invoice documents" --output-dir ./datasets/invoices --limit 100

# Dataset is now ready at ./datasets/invoices/
# - images/ contains downloaded invoice images
    # - labels/invoice_dataset_labels.jsonl contains comprehensive labels

πŸ”Œ Offline Usage

The unit tests simulate the entire pipeline without making real network calls. This is useful when API access is unavailable.

pip install -r requirements.txt
pytest

To try the CLI with predownloaded images, set dummy API keys and point the download command at a text file of image URLs:

export SERPER_API_KEY=dummy
export OPENAI_API_KEY=dummy
ocrdlp download --urls-file sample_urls.txt --output-dir ./offline_demo
ocrdlp classify ./offline_demo/images --output offline_labels.jsonl

πŸ› οΈ Development

Run Tests

python test_viability.py
python test_image_labeling.py

Direct Labeling (Alternative)

python gpt4v_image_labeler.py ./images invoice_labels.jsonl

πŸ“ˆ Key Benefits

  1. Automated Dataset Creation - No manual labeling required
  2. Multi-Purpose Labels - One dataset serves multiple model types
  3. Production-Ready - Standard ML dataset format
  4. Scalable - Can generate thousands of labeled images
  5. Quality Assured - Built-in validation and quality checks

This is a CRAWLER for DATASET GENERATION, not a model evaluation tool.

Generated datasets are ready for use in training OCR, DLP, and document classification models.

⚠️ Image Rights Reminder

When collecting images from search engines or photo sites (Google Images, Unsplash, etc.), verify that you have permission to use and redistribute each image. Check the licensing terms and respect copyright restrictions before sharing any dataset.

About

CLI toolkit to crawl images and build GPT-4V-labeled datasets for OCR & DLP model training

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published