Label: performance, backend
Description: Content extraction (fetching full page text) is a bottleneck. Using r.jina.ai is highly effective but can be slow or encounter rate limits. Our direct scraper (
extractTextFromHtml
) is faster but often blocked.
Suggested Approach:
Implement a "Hybrid Race" strategy: Trigger Jina AI and Direct Fetching simultaneously, and take the first one that returns 1000+ characters of clean text.
Implement a simple LRU cache for extracted content to avoid re-fetching the same URLs in follow-up queries.
Optimize the boilerplateSignals list to better clean up modern cookie banners and "Subscribe" overlays.
Label: performance, backend
Description: Content extraction (fetching full page text) is a bottleneck. Using r.jina.ai is highly effective but can be slow or encounter rate limits. Our direct scraper (
extractTextFromHtml
) is faster but often blocked.
Suggested Approach:
Implement a "Hybrid Race" strategy: Trigger Jina AI and Direct Fetching simultaneously, and take the first one that returns 1000+ characters of clean text.
Implement a simple LRU cache for extracted content to avoid re-fetching the same URLs in follow-up queries.
Optimize the boilerplateSignals list to better clean up modern cookie banners and "Subscribe" overlays.