Multi-backend web scraping orchestrator for CortexPrism. Agents can scrape URLs, crawl sites, search the web, extract structured data with AI-powered schemas, monitor pages for changes, and export datasets. Supports Firecrawl, Tavily, Apify, Bright Data, Oxylabs, Jina AI, Brave, and Exa with automatic failover across configured backends.
# From marketplace
cortex plugin install marketplace:cortex-plugin-web-scraper
# From GitHub (for development)
cortex plugin install github:CortexPrism/cortex-plugin-web-scraper
# Local installation (for development)
cortex plugin install ./manifest.jsonThis plugin supports multiple scraping and search backends. Configure API keys in plugin settings:
| Backend | Key Setting | Get Key At | Purpose |
|---|---|---|---|
| Firecrawl | firecrawlApiKey |
https://firecrawl.dev | Scraping, crawling, searching |
| Tavily | tavilyApiKey |
https://tavily.com | Web search |
| Apify | apifyApiToken |
https://apify.com | Scraping, crawling |
| Bright Data | brightdataApiKey |
https://brightdata.com | Proxy scraping |
| Oxylabs | oxylabsApiKey |
https://oxylabs.io | Proxy scraping |
| Jina AI | jinaApiKey |
https://jina.ai | AI reader, extraction |
| Brave Search | braveApiKey |
https://brave.com/search/api/ | Web search |
| Exa | exaApiKey |
https://exa.ai | Web search |
Without API keys, all scrape_url and scrape_crawl tools still work using basic HTTP fetching.
Search tools require an API key for the chosen engine.
cortex plugin config set cortex-plugin-web-scraper firecrawlApiKey "fc-xxxxx"
cortex plugin config set cortex-plugin-web-scraper tavilyApiKey "tvly-xxxxx"Open Cortex settings, navigate to the cortex-plugin-web-scraper section, and fill in the API Keys and General settings fields.
# List available tools
cortex tools list --plugin cortex-plugin-web-scraper
# Use in an agent session
cortex chat --plugin cortex-plugin-web-scraper| Setting | Type | Default | Description |
|---|---|---|---|
firecrawlApiKey |
secret | — | Firecrawl API key |
tavilyApiKey |
secret | — | Tavily API key |
apifyApiToken |
secret | — | Apify API token |
brightdataApiKey |
secret | — | Bright Data API key |
oxylabsApiKey |
secret | — | Oxylabs API key |
jinaApiKey |
secret | — | Jina AI Reader API key |
braveApiKey |
secret | — | Brave Search API key |
exaApiKey |
secret | — | Exa search API key |
defaultMaxPages |
number | 10 | Default max pages for crawl operations |
userAgent |
text | CortexPrism-WebScraper/1.1.0 |
User-Agent header for requests |
requestDelayMs |
number | 1000 | Delay between requests (ms) |
Scrape a single URL and extract content.
Parameters:
url(string, required) — The URL to scrape (HTTP or HTTPS)format(string, optional, default:"text") — Output format:"text","markdown","html", or"screenshot"selector(string, optional) — CSS selector to extract specific contentinclude_metadata(boolean, optional, default:true) — Include page metadatabackend(string, optional, default:"auto") — Preferred backend:"auto","firecrawl","apify","brightdata","oxylabs","jina"timeout_seconds(number, optional, default:30) — Request timeout
Example:
cortex tool call scrape_url --url https://example.com --format markdownExample with selector:
cortex tool call scrape_url --url https://example.com --selector ".main-content"Response includes: URL, format, content, backend_used, metadata (title, description, OG tags, headings), extracted tables, and lists.
Crawl a website starting from a URL, following links up to a configurable depth and page limit.
Parameters:
start_url(string, required) — Starting URL for the crawlmax_pages(number, optional, default:10) — Maximum pages to crawlmax_depth(number, optional, default:2) — Maximum crawl depthsame_domain(boolean, optional, default:true) — Only follow same-domain linkspath_pattern(string, optional) — URL path regex pattern to filter followed linksselector(string, optional) — CSS selector to extract from each pagebackend(string, optional, default:"auto") — Preferred backend:"auto","firecrawl","apify"
Example:
cortex tool call scrape_crawl --start_url https://docs.example.com --max_pages 20 --max_depth 3Example with path filtering:
cortex tool call scrape_crawl --start_url https://example.com/blog --path_pattern "^/blog/" --max_pages 50Response includes: Array of crawled pages with URL, depth, title, content, and content length.
Search the web and extract structured results.
Parameters:
query(string, required) — Search querymax_results(number, optional, default:10) — Max results to returnengine(string, optional, default:"tavily") — Search engine:"tavily","firecrawl","brave", or"exa"scrape_results(boolean, optional, default:false) — Also scrape each result page for full content
Example:
cortex tool call scrape_search --query "latest AI research papers 2026" --engine tavily --max_results 5Note: Requires an API key configured for the chosen engine. Without an API key, a warning is returned.
Extract structured data from a URL following a JSON schema.
Parameters:
url(string, required) — URL to extract fromschema(string, required) — JSON schema string defining the extraction structuremultiple(boolean, optional, default:false) — Extract multiple matching itemsbackend(string, optional, default:"auto") — Extraction backend:"auto","firecrawl","jina","apify"
Example:
cortex tool call scrape_extract_schema \
--url https://example.com/products \
--schema '{"properties":{"name":{"type":"string"},"price":{"type":"string"},"description":{"type":"string"}}}' \
--multiple trueExtraction strategies (tried in order):
<meta>tags matching schema keysog:meta tagsitempropattributes- JSON-LD structured data
- Heading-based section extraction
Check the health, rate-limit status, and remaining quota of all configured backends.
Parameters: None
Example:
cortex tool call scrape_statusResponse includes: Per-backend status (configured, reachable), summary of configured/reachable counts, and a recommended backend.
Monitor a URL for changes by recording a content hash baseline.
Parameters:
url(string, required) — URL to monitorinterval_hours(number, optional, default:24) — Suggested check intervalselector(string, optional) — Monitor only a specific part of the page
Example:
cortex tool call scrape_monitor --url https://example.com/status --selector ".status-banner"Response includes: Changed flag, previous/current content hashes, last check timestamps.
Export scraped data to a format.
Parameters:
format(string, optional, default:"json") —"json","csv", or"markdown"data(string, required) — JSON array string of scraped items
Example:
cortex tool call scrape_export --format csv --data '[{"name":"Item 1","price":"$10"},{"name":"Item 2","price":"$20"}]'Output formats:
- json — Pretty-printed JSON array
- csv — RFC 4180-compatible CSV with headers
- markdown — GitHub-flavored markdown table
This plugin declares:
tools— Core plugin capabilitynetwork:fetch— Makes HTTP/HTTPS requests to scrape web pages and call APIs
# Install dependencies
deno cache mod.ts
# Run tests
deno task test
# Format code
deno fmt
# Lint
deno lint# Run all tests
deno task test
# Run specific test
deno test --allow-all test/unit/mod.test.ts --filter "scrape_url"
# Run with coverage
deno test --coverage=.coverage --allow-all test/deno task validateDo:
- Validate all tool parameters before use
- Handle errors gracefully with try-catch
- Return
ToolCallResultwithsuccess,output/error, anddurationMs - Respect
requestDelayMsbetween crawl requests - Use
AbortSignal.timeoutfor all HTTP requests
Don't:
- Hardcode API keys or secrets (use plugin config)
- Request overly broad permissions
- Ignore errors or timeouts
- Crawl without respecting robots.txt conventions
Ensure the URL is accessible and returns HTML content. Some sites block programmatic access.
Configure the appropriate API key (tavilyApiKey, firecrawlApiKey, braveApiKey, or exaApiKey)
in plugin settings to enable search engine API calls.
Reduce max_pages, decrease max_depth, or lower requestDelayMs in plugin settings.
MIT — See LICENSE file.
See CONTRIBUTING.md for development standards.