A FastAPI-based service that provides endpoints for web scraping (using Selenium with stealth capabilities), generating text embeddings, and reranking documents based on relevance.
- Web Scraper:
- Headless Chrome scraping using Selenium.
- Stealth mode to avoid bot detection (
selenium-stealth). - Proxy support with random rotation from
proxies.txt. - Human-like interaction (scrolling, random delays).
- Extract page title, description, content, links, and optionally download/base64-encode images.
- Embeddings:
- Generate high-quality text embeddings using
jinaai/jina-embeddings-v3. - Supports GPU acceleration if CUDA is available.
- Generate high-quality text embeddings using
- Reranker:
- Rerank documents for better retrieval relevance using
jinaai/jina-reranker-v2-base-multilingual. - Provides relevance scores for query-document pairs.
- Rerank documents for better retrieval relevance using
- Python 3.8+
- Chrome Browser installed.
- ChromeDriver (ensure the path in
scraper.pymatches your local installation).
-
Clone the repository:
git clone <repository-url> cd scraper-embedding-reranker
-
Install dependencies:
pip install -r requirements.txt
-
Configure Proxies:
- Add your proxies to
proxies.txt(one per line).
- Add your proxies to
-
Update ChromeDriver Path:
- Edit
scraper.pyand setself.path_to_chromedriverto yourchromedriver.exelocation.
- Edit
Start the server:
python app.pyThe server will run on http://0.0.0.0:1000.
- URL:
/scrape - Method:
POST - Payload:
{ "url": "https://example.com", "download_images": false, "get_html": false }
- URL:
/v1/embeddings - Method:
POST - Payload:
{ "model": "jina-embeddings-v3", "input": ["text to embed", "another text"], "dimensions": 1024 }
- URL:
/v1/rerank - Method:
POST - Payload:
{ "model": "jina-reranker-v2-base-multilingual", "query": "What is the capital of France?", "documents": ["Paris is the capital of France.", "Berlin is the capital of Germany."], "top_n": 1 }