OCR Text Extraction

The ocr_screenshot tool extracts all visible text from a screenshot with tap-ready coordinates. This is useful when accessibility labels are missing or when you need to find text that isn't exposed in the accessibility tree.

Note: Many iOS interaction tools (swipe, text input, accessibility queries) require IDB. See the Platform Setup section for installation instructions.

Why OCR?

Approach	Pros	Cons
Accessibility tree (`find_element`)	Fast, reliable, low token usage	Only finds elements with accessibility labels
Screenshot + Vision	Visual layout understanding	High token usage, slow
OCR	Works on ANY visible text, returns tap coordinates	Requires text to be visible, may miss small text

Usage

ocr_screenshot with platform="ios"

Returns all visible text with tap-ready coordinates:

{
  "platform": "ios",
  "engine": "cloud",
  "processingTimeMs": 550,
  "elementCount": 24,
  "elements": [
    { "text": "Settings", "confidence": 95, "tapX": 195, "tapY": 52 },
    { "text": "Login", "confidence": 95, "tapX": 187, "tapY": 420 }
  ]
}

Then tap the element:

tap with x=187 y=420

OCR Engine

OCR uses Google Cloud Vision API via a cloud proxy for fast, accurate text recognition (~97%+ accuracy, ~0.5s processing time). This works out of the box with no local dependencies.

Screenshots are sent over HTTPS to our cloud endpoint for processing and immediately deleted after recognition — no images are stored.

Offline Fallback (EasyOCR)

If the cloud endpoint is unreachable (no internet, timeout), OCR falls back to local EasyOCR (Python-based). This requires Python 3.6+:

# macOS
brew install python@3.11

# Ubuntu/Debian
sudo apt install python3

EasyOCR and its Python dependencies are installed automatically by node-easyocr. The local fallback is slower (~2-3s) and less accurate (~85-90%) but works offline.

OCR Language Configuration

Google Cloud Vision automatically detects and recognizes text in most languages without configuration.

For the offline EasyOCR fallback, set EASYOCR_LANGUAGES to add language support:

EASYOCR_LANGUAGES=es,fr

Recommended Workflow

Use unified tap - Handles fallback chain automatically
Fall back to OCR - When tap suggests using coordinates
Use screenshot - For visual debugging or layout verification

# Simplest approach — tap handles everything
tap with text="Submit"

# If tap fails, use OCR to find coordinates
ocr_screenshot with platform="android"

# Then tap using coordinates from OCR result
tap with x=540 y=1200

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR Text Extraction

Why OCR?

Usage

OCR Engine

Offline Fallback (EasyOCR)

OCR Language Configuration

Recommended Workflow

FilesExpand file tree

ocr.md

Latest commit

History

ocr.md

File metadata and controls

OCR Text Extraction

Why OCR?

Usage

OCR Engine

Offline Fallback (EasyOCR)

OCR Language Configuration

Recommended Workflow