Scraper, Embedding, and Reranker Service

A FastAPI-based service that provides endpoints for web scraping (using Selenium with stealth capabilities), generating text embeddings, and reranking documents based on relevance.

Features

Web Scraper:
- Headless Chrome scraping using Selenium.
- Stealth mode to avoid bot detection (selenium-stealth).
- Proxy support with random rotation from proxies.txt.
- Human-like interaction (scrolling, random delays).
- Extract page title, description, content, links, and optionally download/base64-encode images.
Embeddings:
- Generate high-quality text embeddings using jinaai/jina-embeddings-v3.
- Supports GPU acceleration if CUDA is available.
Reranker:
- Rerank documents for better retrieval relevance using jinaai/jina-reranker-v2-base-multilingual.
- Provides relevance scores for query-document pairs.

Prerequisites

Python 3.8+
Chrome Browser installed.
ChromeDriver (ensure the path in scraper.py matches your local installation).

Installation

Clone the repository:

git clone <repository-url>
cd scraper-embedding-reranker

Install dependencies:
```
pip install -r requirements.txt
```
Configure Proxies:
- Add your proxies to proxies.txt (one per line).
Update ChromeDriver Path:
- Edit scraper.py and set self.path_to_chromedriver to your chromedriver.exe location.

Usage

Start the server:

python app.py

The server will run on http://0.0.0.0:1000.

API Endpoints

1. Scrape Website

URL: /scrape
Method: POST

Payload:

{
  "url": "https://example.com",
  "download_images": false,
  "get_html": false
}

2. Generate Embeddings

URL: /v1/embeddings
Method: POST

Payload:

{
  "model": "jina-embeddings-v3",
  "input": ["text to embed", "another text"],
  "dimensions": 1024
}

3. Rerank Documents

URL: /v1/rerank
Method: POST

Payload:

{
  "model": "jina-reranker-v2-base-multilingual",
  "query": "What is the capital of France?",
  "documents": ["Paris is the capital of France.", "Berlin is the capital of Germany."],
  "top_n": 1
}

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
README.md		README.md
app.py		app.py
embedding.py		embedding.py
models.py		models.py
proxies.txt		proxies.txt
requirements.txt		requirements.txt
reranker.py		reranker.py
scraper.py		scraper.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scraper, Embedding, and Reranker Service

Features

Prerequisites

Installation

Usage

API Endpoints

1. Scrape Website

2. Generate Embeddings

3. Rerank Documents

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scraper, Embedding, and Reranker Service

Features

Prerequisites

Installation

Usage

API Endpoints

1. Scrape Website

2. Generate Embeddings

3. Rerank Documents

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages