The ocr_screenshot tool extracts all visible text from a screenshot with tap-ready coordinates. This is useful when accessibility labels are missing or when you need to find text that isn't exposed in the accessibility tree.
Note: Many iOS interaction tools (swipe, text input, accessibility queries) require IDB. See the Platform Setup section for installation instructions.
| Approach | Pros | Cons |
|---|---|---|
Accessibility tree (find_element) |
Fast, reliable, low token usage | Only finds elements with accessibility labels |
| Screenshot + Vision | Visual layout understanding | High token usage, slow |
| OCR | Works on ANY visible text, returns tap coordinates | Requires text to be visible, may miss small text |
ocr_screenshot with platform="ios"
Returns all visible text with tap-ready coordinates:
{
"platform": "ios",
"engine": "cloud",
"processingTimeMs": 550,
"elementCount": 24,
"elements": [
{ "text": "Settings", "confidence": 95, "tapX": 195, "tapY": 52 },
{ "text": "Login", "confidence": 95, "tapX": 187, "tapY": 420 }
]
}Then tap the element:
tap with x=187 y=420
OCR uses Google Cloud Vision API via a cloud proxy for fast, accurate text recognition (~97%+ accuracy, ~0.5s processing time). This works out of the box with no local dependencies.
Screenshots are sent over HTTPS to our cloud endpoint for processing and immediately deleted after recognition — no images are stored.
If the cloud endpoint is unreachable (no internet, timeout), OCR falls back to local EasyOCR (Python-based). This requires Python 3.6+:
# macOS
brew install python@3.11
# Ubuntu/Debian
sudo apt install python3EasyOCR and its Python dependencies are installed automatically by node-easyocr. The local fallback is slower (~2-3s) and less accurate (~85-90%) but works offline.
Google Cloud Vision automatically detects and recognizes text in most languages without configuration.
For the offline EasyOCR fallback, set EASYOCR_LANGUAGES to add language support:
EASYOCR_LANGUAGES=es,fr- Use unified
tap- Handles fallback chain automatically - Fall back to OCR - When
tapsuggests using coordinates - Use screenshot - For visual debugging or layout verification
# Simplest approach — tap handles everything
tap with text="Submit"
# If tap fails, use OCR to find coordinates
ocr_screenshot with platform="android"
# Then tap using coordinates from OCR result
tap with x=540 y=1200