A production-grade proof-of-concept (PoC) for an intelligent search and content extraction pipeline that leverages Large Language Models (LLMs) to gather, analyze, and rank web content based on user queries.
The Aggregator is designed to enhance information retrieval by:
- Semantic Query Expansion: Generating multiple variations of user queries to capture broader search intent
- Concurrent Web Scraping: Efficiently fetching and processing multiple web pages simultaneously
- LLM-Powered Content Extraction: Using AI to extract only the most relevant textual content from HTML
- Intelligent Relevance Grading: Scoring extracted content on a 1-10 scale with a detailed rubric
- Structured Output: Providing clean, organized JSON results for downstream processing
This PoC demonstrates the potential of AI-driven web aggregation for applications like research assistants, content curation, and automated knowledge discovery.
graph TD
A[User Query] --> B[Query Variation Generation]
B --> C[Concurrent Search via SearXNG]
C --> D[Async Web Scraping]
D --> E[LLM Content Extraction]
E --> F[Relevance Grading]
F --> G[Structured Results]
- Pipeline Orchestrator: Coordinates the entire search and extraction workflow
- LLM Client: Handles all interactions with Ollama for query expansion, content extraction, and grading
- Search Service: Interfaces with SearXNG for privacy-focused web search
- Async Scraping Service: Concurrently fetches and processes web content
- HTML Processor: Cleans and optimizes HTML for LLM processing
- HTTP Clients: Robust synchronous and asynchronous HTTP handling with retry logic
- π Multi-Variation Search: Generates semantic variations of queries for comprehensive coverage
- β‘ High Performance: Asynchronous processing with configurable concurrency limits
- π§ AI-Powered Analysis: LLM-driven content extraction and relevance assessment
- π Privacy-Focused: Uses SearXNG for decentralized search without tracking
- π‘οΈ Robust Error Handling: Comprehensive exception handling and logging
- π Structured Output: Clean JSON results with grading and metadata
- βοΈ Highly Configurable: Environment-based configuration with sensible defaults
This application uses the OpenAI API to connect to the LLM server, making it compatible with any OpenAI-compatible API provider or service. This includes popular options like:
- Ollama for local LLM hosting
- OpenAI official API
- PromptShield - a great alternative that provides access to potentially 500 models across 14 providers
To use a different provider, simply set the appropriate API endpoint and key in your configuration.
- Python 3.8+
- An LLM server compatible with the OpenAI API (such as Ollama or any OpenAI-compatible provider)
- SearXNG instance (or compatible search API)
pip install requests openai beautifulsoup4 tenacity aiohttp-
Clone and navigate to the project:
git clone https://github.com/PromptShieldLabs/aggregator.git cd aggregator -
Install dependencies:
pip install -r requirements.txt
-
Configure environment variables:
export SEARXNG_URL="http://your-searxng-instance:port" export OLLAMA_URL="http://localhost:11434/v1" export OLLAMA_MODEL="your-preferred-model" export OLLAMA_API_KEY="your-api-key"
python aggregator.py "What are the key features of FastAPI?"# Output to file
python aggregator.py "How does Python asyncio work?" --output results.json
# Configuration validation
python aggregator.py --config-checkfrom aggregator import Application, Config
config = Config.from_env()
app = Application(config)
question = "What are the benefits of microservices architecture?"
results = app.run(question)
for result in results:
print(f"Title: {result['title']}")
print(f"Grade: {result['grade']}/10")
print(f"URL: {result['url']}")
print("---")The application uses environment variables for configuration:
| Variable | Description | Default |
|---|---|---|
SEARXNG_URL |
SearXNG instance URL | http://127.0.0.1:8888 |
OLLAMA_URL |
Ollama API endpoint | http://127.0.0.1:11434/v1 |
OLLAMA_MODEL |
LLM model name | a_local_model |
OLLAMA_API_KEY |
API key for Ollama | ollama |
HTTP_TIMEOUT |
Request timeout in seconds | 15 |
USER_AGENT |
HTTP User-Agent string | Chrome-like string |
MAX_HTML_CHARS |
Maximum HTML characters to process | 120000 |
LOG_LEVEL |
Logging verbosity | INFO |
NUM_VARIATIONS |
Number of query variations | 3 |
URLS_PER_VARIATION |
URLs to fetch per variation | 3 |
MAX_RETRIES |
Maximum retry attempts | 3 |
MAX_CONCURRENT_REQUESTS |
Concurrent request limit | 5 |
Results are returned as a JSON array of objects:
[
{
"url": "https://example.com/article",
"title": "Comprehensive Guide to FastAPI",
"variation": "What are the main features of FastAPI framework?",
"grade": 9,
"text": "FastAPI is a modern web framework for building APIs with Python..."
}
]- 10: Official documentation that directly and completely answers the question
- 9: Comprehensive tutorial or guide that thoroughly addresses the question
- 8: Detailed article with substantial relevant information
- 7: Good explanation with most key points covered
- 6: Partial answer with some relevant details
- 5: Mentions the topic but lacks depth
- 4: Tangentially related content
- 3: Minimal relevance to the question
- 2: Barely related or off-topic
- 1: Completely irrelevant
python aggregator.py "Explain quantum computing"python aggregator.py "What are the current challenges in renewable energy storage?"python aggregator.py "How to implement JWT authentication in Node.js"- Concurrency: Adjust
MAX_CONCURRENT_REQUESTSbased on your system's capabilities and API rate limits - HTML Processing: The
MAX_HTML_CHARSlimit prevents excessive processing of large pages - LLM Efficiency: Content extraction focuses on main text to minimize token usage
- Caching: Consider implementing result caching for repeated queries
The pipeline includes comprehensive error handling for:
- Network timeouts and connection failures
- Invalid URLs and malformed responses
- LLM API errors and timeouts
- HTML parsing failures
- Configuration validation errors
All errors are logged with appropriate severity levels.
- Input Validation: All inputs are validated before processing
- Rate Limiting: Configurable concurrency prevents overwhelming external services
- Privacy: Uses SearXNG for privacy-preserving search
- API Keys: Securely manage Ollama API credentials
This is a proof-of-concept implementation. Contributions are welcome for:
- Performance optimizations
- Additional LLM integrations
- Enhanced content extraction algorithms
- New search backend support
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with Ollama for LLM capabilities
- Powered by SearXNG for privacy-focused search
- Inspired by modern AI-driven information retrieval research
PromptShield Labs - Exploring the frontiers of AI-powered security and information processing.