Aggregator - PromptShield Labs PoC

A production-grade proof-of-concept (PoC) for an intelligent search and content extraction pipeline that leverages Large Language Models (LLMs) to gather, analyze, and rank web content based on user queries.

Overview

The Aggregator is designed to enhance information retrieval by:

Semantic Query Expansion: Generating multiple variations of user queries to capture broader search intent
Concurrent Web Scraping: Efficiently fetching and processing multiple web pages simultaneously
LLM-Powered Content Extraction: Using AI to extract only the most relevant textual content from HTML
Intelligent Relevance Grading: Scoring extracted content on a 1-10 scale with a detailed rubric
Structured Output: Providing clean, organized JSON results for downstream processing

This PoC demonstrates the potential of AI-driven web aggregation for applications like research assistants, content curation, and automated knowledge discovery.

Architecture

graph TD
    A[User Query] --> B[Query Variation Generation]
    B --> C[Concurrent Search via SearXNG]
    C --> D[Async Web Scraping]
    D --> E[LLM Content Extraction]
    E --> F[Relevance Grading]
    F --> G[Structured Results]

Key Components

Pipeline Orchestrator: Coordinates the entire search and extraction workflow
LLM Client: Handles all interactions with Ollama for query expansion, content extraction, and grading
Search Service: Interfaces with SearXNG for privacy-focused web search
Async Scraping Service: Concurrently fetches and processes web content
HTML Processor: Cleans and optimizes HTML for LLM processing
HTTP Clients: Robust synchronous and asynchronous HTTP handling with retry logic

Features

🔍 Multi-Variation Search: Generates semantic variations of queries for comprehensive coverage
⚡ High Performance: Asynchronous processing with configurable concurrency limits
🧠 AI-Powered Analysis: LLM-driven content extraction and relevance assessment
🔒 Privacy-Focused: Uses SearXNG for decentralized search without tracking
🛡️ Robust Error Handling: Comprehensive exception handling and logging
📊 Structured Output: Clean JSON results with grading and metadata
⚙️ Highly Configurable: Environment-based configuration with sensible defaults

LLM Compatibility

This application uses the OpenAI API to connect to the LLM server, making it compatible with any OpenAI-compatible API provider or service. This includes popular options like:

Ollama for local LLM hosting
OpenAI official API
PromptShield - a great alternative that provides access to potentially 500 models across 14 providers

To use a different provider, simply set the appropriate API endpoint and key in your configuration.

Installation

Prerequisites

Python 3.8+
An LLM server compatible with the OpenAI API (such as Ollama or any OpenAI-compatible provider)
SearXNG instance (or compatible search API)

Dependencies

pip install requests openai beautifulsoup4 tenacity aiohttp

Setup

Clone and navigate to the project:

git clone https://github.com/PromptShieldLabs/aggregator.git
cd aggregator

Install dependencies:
```
pip install -r requirements.txt
```

Configure environment variables:

export SEARXNG_URL="http://your-searxng-instance:port"
export OLLAMA_URL="http://localhost:11434/v1"
export OLLAMA_MODEL="your-preferred-model"
export OLLAMA_API_KEY="your-api-key"

Usage

Command Line

python aggregator.py "What are the key features of FastAPI?"

Advanced Usage

# Output to file
python aggregator.py "How does Python asyncio work?" --output results.json

# Configuration validation
python aggregator.py --config-check

Programmatic Usage

from aggregator import Application, Config

config = Config.from_env()
app = Application(config)

question = "What are the benefits of microservices architecture?"
results = app.run(question)

for result in results:
    print(f"Title: {result['title']}")
    print(f"Grade: {result['grade']}/10")
    print(f"URL: {result['url']}")
    print("---")

Configuration

The application uses environment variables for configuration:

Variable	Description	Default
`SEARXNG_URL`	SearXNG instance URL	`http://127.0.0.1:8888`
`OLLAMA_URL`	Ollama API endpoint	`http://127.0.0.1:11434/v1`
`OLLAMA_MODEL`	LLM model name	`a_local_model`
`OLLAMA_API_KEY`	API key for Ollama	`ollama`
`HTTP_TIMEOUT`	Request timeout in seconds	`15`
`USER_AGENT`	HTTP User-Agent string	Chrome-like string
`MAX_HTML_CHARS`	Maximum HTML characters to process	`120000`
`LOG_LEVEL`	Logging verbosity	`INFO`
`NUM_VARIATIONS`	Number of query variations	`3`
`URLS_PER_VARIATION`	URLs to fetch per variation	`3`
`MAX_RETRIES`	Maximum retry attempts	`3`
`MAX_CONCURRENT_REQUESTS`	Concurrent request limit	`5`

Output Format

Results are returned as a JSON array of objects:

[
  {
    "url": "https://example.com/article",
    "title": "Comprehensive Guide to FastAPI",
    "variation": "What are the main features of FastAPI framework?",
    "grade": 9,
    "text": "FastAPI is a modern web framework for building APIs with Python..."
  }
]

Grading Rubric

10: Official documentation that directly and completely answers the question
9: Comprehensive tutorial or guide that thoroughly addresses the question
8: Detailed article with substantial relevant information
7: Good explanation with most key points covered
6: Partial answer with some relevant details
5: Mentions the topic but lacks depth
4: Tangentially related content
3: Minimal relevance to the question
2: Barely related or off-topic
1: Completely irrelevant

Examples

Simple Query

python aggregator.py "Explain quantum computing"

Complex Research Question

python aggregator.py "What are the current challenges in renewable energy storage?"

Technical Documentation Search

python aggregator.py "How to implement JWT authentication in Node.js"

Performance Considerations

Concurrency: Adjust MAX_CONCURRENT_REQUESTS based on your system's capabilities and API rate limits
HTML Processing: The MAX_HTML_CHARS limit prevents excessive processing of large pages
LLM Efficiency: Content extraction focuses on main text to minimize token usage
Caching: Consider implementing result caching for repeated queries

Error Handling

The pipeline includes comprehensive error handling for:

Network timeouts and connection failures
Invalid URLs and malformed responses
LLM API errors and timeouts
HTML parsing failures
Configuration validation errors

All errors are logged with appropriate severity levels.

Security Considerations

Input Validation: All inputs are validated before processing
Rate Limiting: Configurable concurrency prevents overwhelming external services
Privacy: Uses SearXNG for privacy-preserving search
API Keys: Securely manage Ollama API credentials

Contributing

This is a proof-of-concept implementation. Contributions are welcome for:

Performance optimizations
Additional LLM integrations
Enhanced content extraction algorithms
New search backend support

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built with Ollama for LLM capabilities
Powered by SearXNG for privacy-focused search
Inspired by modern AI-driven information retrieval research

PromptShield Labs - Exploring the frontiers of AI-powered security and information processing.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
aggregator.py		aggregator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Aggregator - PromptShield Labs PoC

Overview

Architecture

Key Components

Features

LLM Compatibility

Installation

Prerequisites

Dependencies

Setup

Usage

Command Line

Advanced Usage

Programmatic Usage

Configuration

Output Format

Grading Rubric

Examples

Simple Query

Complex Research Question

Technical Documentation Search

Performance Considerations

Error Handling

Security Considerations

Contributing

License

Acknowledgments

About

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Aggregator - PromptShield Labs PoC

Overview

Architecture

Key Components

Features

LLM Compatibility

Installation

Prerequisites

Dependencies

Setup

Usage

Command Line

Advanced Usage

Programmatic Usage

Configuration

Output Format

Grading Rubric

Examples

Simple Query

Complex Research Question

Technical Documentation Search

Performance Considerations

Error Handling

Security Considerations

Contributing

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages