[Performance] Hybrid Content extraction with Jina AI & Direct Scraper

Label: performance, backend

Description: Content extraction (fetching full page text) is a bottleneck. Using r.jina.ai is highly effective but can be slow or encounter rate limits. Our direct scraper (

extractTextFromHtml
) is faster but often blocked.

Suggested Approach:

Implement a "Hybrid Race" strategy: Trigger Jina AI and Direct Fetching simultaneously, and take the first one that returns 1000+ characters of clean text.
Implement a simple LRU cache for extracted content to avoid re-fetching the same URLs in follow-up queries.
Optimize the boilerplateSignals list to better clean up modern cookie banners and "Subscribe" overlays.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Hybrid Content extraction with Jina AI & Direct Scraper #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Performance] Hybrid Content extraction with Jina AI & Direct Scraper #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions