Skip to content

Latest commit

 

History

History
87 lines (60 loc) · 2.7 KB

File metadata and controls

87 lines (60 loc) · 2.7 KB

OCR Text Extraction

The ocr_screenshot tool extracts all visible text from a screenshot with tap-ready coordinates. This is useful when accessibility labels are missing or when you need to find text that isn't exposed in the accessibility tree.

Note: Many iOS interaction tools (swipe, text input, accessibility queries) require IDB. See the Platform Setup section for installation instructions.

Why OCR?

Approach Pros Cons
Accessibility tree (find_element) Fast, reliable, low token usage Only finds elements with accessibility labels
Screenshot + Vision Visual layout understanding High token usage, slow
OCR Works on ANY visible text, returns tap coordinates Requires text to be visible, may miss small text

Usage

ocr_screenshot with platform="ios"

Returns all visible text with tap-ready coordinates:

{
  "platform": "ios",
  "engine": "cloud",
  "processingTimeMs": 550,
  "elementCount": 24,
  "elements": [
    { "text": "Settings", "confidence": 95, "tapX": 195, "tapY": 52 },
    { "text": "Login", "confidence": 95, "tapX": 187, "tapY": 420 }
  ]
}

Then tap the element:

tap with x=187 y=420

OCR Engine

OCR uses Google Cloud Vision API via a cloud proxy for fast, accurate text recognition (~97%+ accuracy, ~0.5s processing time). This works out of the box with no local dependencies.

Screenshots are sent over HTTPS to our cloud endpoint for processing and immediately deleted after recognition — no images are stored.

Offline Fallback (EasyOCR)

If the cloud endpoint is unreachable (no internet, timeout), OCR falls back to local EasyOCR (Python-based). This requires Python 3.6+:

# macOS
brew install python@3.11

# Ubuntu/Debian
sudo apt install python3

EasyOCR and its Python dependencies are installed automatically by node-easyocr. The local fallback is slower (~2-3s) and less accurate (~85-90%) but works offline.

OCR Language Configuration

Google Cloud Vision automatically detects and recognizes text in most languages without configuration.

For the offline EasyOCR fallback, set EASYOCR_LANGUAGES to add language support:

EASYOCR_LANGUAGES=es,fr

Recommended Workflow

  1. Use unified tap - Handles fallback chain automatically
  2. Fall back to OCR - When tap suggests using coordinates
  3. Use screenshot - For visual debugging or layout verification
# Simplest approach — tap handles everything
tap with text="Submit"

# If tap fails, use OCR to find coordinates
ocr_screenshot with platform="android"

# Then tap using coordinates from OCR result
tap with x=540 y=1200