A powerful web-based tool for automatically classifying images using Optical Character Recognition (OCR) with Tesseract.js. This tool extracts text from images and automatically categorizes them based on configurable keyword rules.
- Multi-format Support: Upload images in various formats (PNG, JPG, JPEG, etc.)
- Batch Processing: Process multiple images simultaneously
- High-Quality OCR: Powered by Tesseract.js for accurate text extraction
- English + Traditional Chinese (default)
- English only
- Traditional Chinese only
- Japanese
- Korean
- French
- German
- Spanish
- Custom Rules: Create classification rules with keywords and colors
- Automatic Categorization: Images are automatically classified based on extracted text
- Visual Indicators: Color-coded classifications for easy identification
- Real-time Updates: Classification updates as you modify rules
- Word Highlighting: Click on extracted words to highlight them on the image
- Zoom & Pan: Navigate through images with mouse wheel zoom and drag pan
- Bounding Boxes: Visual representation of detected text regions
- Accordion View: Expandable list of all extracted words
- CSV Export: Download classification results in CSV format
- Batch Results: Process and export results for multiple images
- Status Tracking: Monitor processing status for each image
- Click "Upload Images" to select one or more image files
- Images will be automatically queued for OCR processing
- Processing status is displayed on each thumbnail
- Add New Rule: Click "Add New Rule" to create a new classification
- Keywords: Enter comma-separated keywords that trigger classification
- Colors: Choose a color for each classification category
- Priority: Rules are processed in order (first match wins)
- Watch real-time progress as images are processed
- Status indicators show: "queued", "processing...", "classifying...", "done", or "error"
- Classified images display their category and color
- Select Images: Click thumbnails to view full-size images
- Explore Text: Use the accordion to view all extracted words
- Highlight Words: Click on words to highlight their bounding boxes
- Navigate: Use mouse wheel to zoom, drag to pan
- Click "Download CSV" to export all classifications
- CSV includes: Image Name, Classification, Extracted Text
The tool comes with two example rules:
- Invoice: Keywords:
invoice,bill,receipt(Red) - Contract: Keywords:
agreement,contract,terms(Blue)
- Click "Add New Rule"
- Enter a descriptive name (e.g., "Receipt", "Form", "Letter")
- Add comma-separated keywords that appear in that document type
- Choose a distinctive color
- Rules are automatically applied to all images
- Use specific, unique terms for better accuracy
- Include variations and synonyms
- Avoid overly common words
- Test with sample images to refine rules
- Tesseract.js: WebAssembly port of Google's Tesseract OCR engine
- Worker-based: Non-blocking OCR processing
- Language Models: Pre-trained models for multiple languages
- Parallel Processing: Multiple images processed simultaneously
- Memory Efficient: Images processed in chunks
- Progress Tracking: Real-time status updates
- Modern browsers with WebAssembly support
- Chrome, Firefox, Safari, Edge (latest versions)
- Mobile browsers may have performance limitations
- Automatically categorize invoices, receipts, contracts
- Organize scanned documents by type
- Batch process large document collections
- Extract text from forms and surveys
- Process handwritten notes and documents
- Analyze image-based content
- Verify document types match expected content
- Identify misclassified documents
- Monitor document processing workflows
- Use high-resolution images (300+ DPI)
- Ensure good contrast between text and background
- Avoid blurry or skewed images
- Use consistent lighting
- Start with specific, unique terms
- Test rules with sample images
- Refine keywords based on results
- Consider document context
- Choose the primary language of your documents
- Use multi-language option for mixed content
- Consider regional variations in text
OCR Not Working
- Check browser console for errors
- Ensure images are valid and not corrupted
- Try refreshing the page and re-uploading
Poor Classification Accuracy
- Review and refine keyword rules
- Check image quality and text clarity
- Verify language selection matches document language
Slow Processing
- Reduce image resolution if possible
- Process fewer images simultaneously
- Check browser performance settings
- Close other browser tabs
- Use modern, fast browsers
- Ensure stable internet connection
- Process images in smaller batches
- WebAssembly Support: Required for Tesseract.js
- Modern JavaScript: ES6+ features
- Memory: Sufficient RAM for image processing
- Storage: Local storage for user preferences
- Local Processing: All OCR processing happens in your browser
- No Upload: Images are not sent to external servers
- Temporary Storage: Images are stored only during the session
- Data Export: Only processed results are exported
- Machine Learning: Improved classification accuracy
- Template Matching: Visual pattern recognition
- Cloud Processing: Option for server-side processing
- API Integration: Connect with external document systems
- Advanced Export: PDF reports and detailed analytics
Version: 1.0
Last Updated: 2025
License: Open Source
Contributors: PDF Toolbox Team