Skip to content

ZaguanLabs/aggregator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Aggregator - PromptShield Labs PoC

License: MIT

A production-grade proof-of-concept (PoC) for an intelligent search and content extraction pipeline that leverages Large Language Models (LLMs) to gather, analyze, and rank web content based on user queries.

Overview

The Aggregator is designed to enhance information retrieval by:

  • Semantic Query Expansion: Generating multiple variations of user queries to capture broader search intent
  • Concurrent Web Scraping: Efficiently fetching and processing multiple web pages simultaneously
  • LLM-Powered Content Extraction: Using AI to extract only the most relevant textual content from HTML
  • Intelligent Relevance Grading: Scoring extracted content on a 1-10 scale with a detailed rubric
  • Structured Output: Providing clean, organized JSON results for downstream processing

This PoC demonstrates the potential of AI-driven web aggregation for applications like research assistants, content curation, and automated knowledge discovery.

Architecture

graph TD
    A[User Query] --> B[Query Variation Generation]
    B --> C[Concurrent Search via SearXNG]
    C --> D[Async Web Scraping]
    D --> E[LLM Content Extraction]
    E --> F[Relevance Grading]
    F --> G[Structured Results]
Loading

Key Components

  • Pipeline Orchestrator: Coordinates the entire search and extraction workflow
  • LLM Client: Handles all interactions with Ollama for query expansion, content extraction, and grading
  • Search Service: Interfaces with SearXNG for privacy-focused web search
  • Async Scraping Service: Concurrently fetches and processes web content
  • HTML Processor: Cleans and optimizes HTML for LLM processing
  • HTTP Clients: Robust synchronous and asynchronous HTTP handling with retry logic

Features

  • πŸ” Multi-Variation Search: Generates semantic variations of queries for comprehensive coverage
  • ⚑ High Performance: Asynchronous processing with configurable concurrency limits
  • 🧠 AI-Powered Analysis: LLM-driven content extraction and relevance assessment
  • πŸ”’ Privacy-Focused: Uses SearXNG for decentralized search without tracking
  • πŸ›‘οΈ Robust Error Handling: Comprehensive exception handling and logging
  • πŸ“Š Structured Output: Clean JSON results with grading and metadata
  • βš™οΈ Highly Configurable: Environment-based configuration with sensible defaults

LLM Compatibility

This application uses the OpenAI API to connect to the LLM server, making it compatible with any OpenAI-compatible API provider or service. This includes popular options like:

  • Ollama for local LLM hosting
  • OpenAI official API
  • PromptShield - a great alternative that provides access to potentially 500 models across 14 providers

To use a different provider, simply set the appropriate API endpoint and key in your configuration.

Installation

Prerequisites

  • Python 3.8+
  • An LLM server compatible with the OpenAI API (such as Ollama or any OpenAI-compatible provider)
  • SearXNG instance (or compatible search API)

Dependencies

pip install requests openai beautifulsoup4 tenacity aiohttp

Setup

  1. Clone and navigate to the project:

    git clone https://github.com/PromptShieldLabs/aggregator.git
    cd aggregator
  2. Install dependencies:

    pip install -r requirements.txt
  3. Configure environment variables:

    export SEARXNG_URL="http://your-searxng-instance:port"
    export OLLAMA_URL="http://localhost:11434/v1"
    export OLLAMA_MODEL="your-preferred-model"
    export OLLAMA_API_KEY="your-api-key"

Usage

Command Line

python aggregator.py "What are the key features of FastAPI?"

Advanced Usage

# Output to file
python aggregator.py "How does Python asyncio work?" --output results.json

# Configuration validation
python aggregator.py --config-check

Programmatic Usage

from aggregator import Application, Config

config = Config.from_env()
app = Application(config)

question = "What are the benefits of microservices architecture?"
results = app.run(question)

for result in results:
    print(f"Title: {result['title']}")
    print(f"Grade: {result['grade']}/10")
    print(f"URL: {result['url']}")
    print("---")

Configuration

The application uses environment variables for configuration:

Variable Description Default
SEARXNG_URL SearXNG instance URL http://127.0.0.1:8888
OLLAMA_URL Ollama API endpoint http://127.0.0.1:11434/v1
OLLAMA_MODEL LLM model name a_local_model
OLLAMA_API_KEY API key for Ollama ollama
HTTP_TIMEOUT Request timeout in seconds 15
USER_AGENT HTTP User-Agent string Chrome-like string
MAX_HTML_CHARS Maximum HTML characters to process 120000
LOG_LEVEL Logging verbosity INFO
NUM_VARIATIONS Number of query variations 3
URLS_PER_VARIATION URLs to fetch per variation 3
MAX_RETRIES Maximum retry attempts 3
MAX_CONCURRENT_REQUESTS Concurrent request limit 5

Output Format

Results are returned as a JSON array of objects:

[
  {
    "url": "https://example.com/article",
    "title": "Comprehensive Guide to FastAPI",
    "variation": "What are the main features of FastAPI framework?",
    "grade": 9,
    "text": "FastAPI is a modern web framework for building APIs with Python..."
  }
]

Grading Rubric

  • 10: Official documentation that directly and completely answers the question
  • 9: Comprehensive tutorial or guide that thoroughly addresses the question
  • 8: Detailed article with substantial relevant information
  • 7: Good explanation with most key points covered
  • 6: Partial answer with some relevant details
  • 5: Mentions the topic but lacks depth
  • 4: Tangentially related content
  • 3: Minimal relevance to the question
  • 2: Barely related or off-topic
  • 1: Completely irrelevant

Examples

Simple Query

python aggregator.py "Explain quantum computing"

Complex Research Question

python aggregator.py "What are the current challenges in renewable energy storage?"

Technical Documentation Search

python aggregator.py "How to implement JWT authentication in Node.js"

Performance Considerations

  • Concurrency: Adjust MAX_CONCURRENT_REQUESTS based on your system's capabilities and API rate limits
  • HTML Processing: The MAX_HTML_CHARS limit prevents excessive processing of large pages
  • LLM Efficiency: Content extraction focuses on main text to minimize token usage
  • Caching: Consider implementing result caching for repeated queries

Error Handling

The pipeline includes comprehensive error handling for:

  • Network timeouts and connection failures
  • Invalid URLs and malformed responses
  • LLM API errors and timeouts
  • HTML parsing failures
  • Configuration validation errors

All errors are logged with appropriate severity levels.

Security Considerations

  • Input Validation: All inputs are validated before processing
  • Rate Limiting: Configurable concurrency prevents overwhelming external services
  • Privacy: Uses SearXNG for privacy-preserving search
  • API Keys: Securely manage Ollama API credentials

Contributing

This is a proof-of-concept implementation. Contributions are welcome for:

  • Performance optimizations
  • Additional LLM integrations
  • Enhanced content extraction algorithms
  • New search backend support

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Built with Ollama for LLM capabilities
  • Powered by SearXNG for privacy-focused search
  • Inspired by modern AI-driven information retrieval research

PromptShield Labs - Exploring the frontiers of AI-powered security and information processing.

About

Aggregator is a PoC that uses local search (SearXNG) and AI to produce a structured JSON response

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages