Create production-ready, GPT-4V-labeled datasets for OCR and Data-Loss-Prevention models in 3 minutes.
# 1 Install
git clone https://github.com/Jackmeson1/ocrdlp-lab && cd ocrdlp-lab
pip install -r requirements.txt # Python 3.8+
# 2 Set API keys
export SERPER_API_KEY=...
export OPENAI_API_KEY=...
# 3 One-shot pipeline demo
python ocrdlp.py pipeline "invoice document" \
--output-dir ./datasets/invoices --limit 50
open datasets/invoices/labels/invoice_dataset_labels.jsonlNeed details? Jump to Configuration β’ CLI Commands
This is a CRAWLER APPLICATION that generates LABELED DATASETS for downstream model training. It crawls images from various sources and uses AI to generate comprehensive labels for training OCR, DLP, and document classification models.
Workflow: CRAWLER β LABELED DATASET β DOWNSTREAM MODEL TRAINING
- π Multi-Engine Image Search: Serper, Google, Bing, DuckDuckGo integration
- π₯ Robust Image Download: Automated download with validation and error handling
- π€ AI-Powered Labeling: GPT-4V integration for intelligent document labeling
- π·οΈ Comprehensive Labels: Multi-purpose labels for OCR, DLP, and classification training
- π Dataset Generation: Creates production-ready datasets with standard structure
- β‘ CLI Interface: Professional command-line tool with subcommands
- β Quality Validation: Built-in dataset quality checks
- β³ Rate-Limit Handling: Automatic retries with exponential backoff
- Python 3.8+
- OpenAI API key (for GPT-4V labeling)
- Serper API key (for image search)
-
Clone the repository
git clone <repository-url> cd ocrdlp-lab
-
Install dependencies
pip install -r requirements.txt
-
Set environment variables
# Windows set SERPER_API_KEY=your_serper_api_key set OPENAI_API_KEY=your_openai_api_key # Linux/Mac export SERPER_API_KEY=your_serper_api_key export OPENAI_API_KEY=your_openai_api_key
python ocrdlp.py --help# Search for document images
python ocrdlp.py search "invoice documents" --engine serper --limit 100 --output urls.txt# Download images directly into the dataset
python ocrdlp.py download --urls-file urls.txt --output-dir ./datasets/invoice_dataset/images# Create label directory
mkdir -p datasets/invoice_dataset/labels
# Generate comprehensive labels
python ocrdlp.py classify datasets/invoice_dataset/images --output datasets/invoice_dataset/labels/invoice_dataset_labels.jsonl# Validate generated dataset
python ocrdlp.py validate datasets/invoice_dataset/labels/invoice_dataset_labels.jsonldatasets/
βββ invoice_dataset/
βββ images/ # Raw images for training
β βββ image_001.jpg
β βββ image_002.png
β βββ ...
βββ labels/ # AI-generated labels
β βββ <dataset>_labels.jsonl # Comprehensive labels
β βββ <dataset>_labels_summary.md # Dataset statistics
βββ README.md # Usage instructions for ML engineers
Each image gets comprehensive labels for multiple downstream use cases:
- Document Classification:
document_category,document_subcategory - OCR Training:
ocr_difficulty,text_clarity,language_primary - DLP Training:
sensitive_data_types,testing_scenarios - Quality Assessment:
image_quality,background_complexity - Processing Hints:
recommended_preprocessing,challenge_factors
import json
def load_ocr_dataset(dataset_path):
labels_path = f"{dataset_path}/labels/<dataset>_labels.jsonl"
with open(labels_path, 'r') as f:
labels = [json.loads(line) for line in f]
return [{
'image_path': label['_file_info']['file_path'],
'difficulty': label['ocr_difficulty'],
'text_clarity': label['text_clarity'],
'language': label['language_primary']
} for label in labels]def load_dlp_dataset(dataset_path):
labels_path = f"{dataset_path}/labels/<dataset>_labels.jsonl"
with open(labels_path, 'r') as f:
labels = [json.loads(line) for line in f]
return [{
'image_path': label['_file_info']['file_path'],
'sensitive_data': label['sensitive_data_types'],
'document_type': label['document_category']
} for label in labels]python ocrdlp.py search "document type" --engine serper --limit 50 --output urls.txtpython ocrdlp.py download --urls-file urls.txt --output-dir ./images
# OR
python ocrdlp.py download --query "invoice" --output-dir ./images --limit 20python ocrdlp.py classify ./images --output invoice_labels.jsonl --validatepython ocrdlp.py pipeline "invoice documents" --output-dir ./invoice_dataset --limit 50python ocrdlp.py validate invoice_labels.jsonl# Complete workflow to create invoice training dataset
python ocrdlp.py pipeline "invoice documents" --output-dir ./datasets/invoices --limit 100
# Dataset is now ready at ./datasets/invoices/
# - images/ contains downloaded invoice images
# - labels/invoice_dataset_labels.jsonl contains comprehensive labelsThe unit tests simulate the entire pipeline without making real network calls. This is useful when API access is unavailable.
pip install -r requirements.txt
pytestTo try the CLI with predownloaded images, set dummy API keys and point the
download command at a text file of image URLs:
export SERPER_API_KEY=dummy
export OPENAI_API_KEY=dummy
ocrdlp download --urls-file sample_urls.txt --output-dir ./offline_demo
ocrdlp classify ./offline_demo/images --output offline_labels.jsonlpython test_viability.py
python test_image_labeling.pypython gpt4v_image_labeler.py ./images invoice_labels.jsonl- Automated Dataset Creation - No manual labeling required
- Multi-Purpose Labels - One dataset serves multiple model types
- Production-Ready - Standard ML dataset format
- Scalable - Can generate thousands of labeled images
- Quality Assured - Built-in validation and quality checks
This is a CRAWLER for DATASET GENERATION, not a model evaluation tool.
Generated datasets are ready for use in training OCR, DLP, and document classification models.
When collecting images from search engines or photo sites (Google Images, Unsplash, etc.), verify that you have permission to use and redistribute each image. Check the licensing terms and respect copyright restrictions before sharing any dataset.
