Skip to content

Latest commit

 

History

History
682 lines (554 loc) · 27.2 KB

File metadata and controls

682 lines (554 loc) · 27.2 KB

Design Document: RAG System

Overview

The RAG (Retrieval-Augmented Generation) system is a production-ready implementation that combines document retrieval with language model generation using LMStudio models and LangGraph orchestration. The system features a modular architecture with comprehensive error handling, runtime configuration management, and full observability.

The system uses Qdrant as the vector database for efficient similarity search, LMStudio (with unsloth/gemma-3-4b-it-GGUF model) for local language model inference, and LangGraph for workflow orchestration with conditional routing. This design ensures privacy (local models), performance (optimized vector search), maintainability (graph-based workflows), and production reliability (comprehensive monitoring and error handling).

Architecture

The system follows a layered architecture with the following components:

graph TB
    subgraph "User Interface Layer"
        CLI[Rich-based CLI Interface]
        API[REST API Interface]
        BATCH[Batch Processing]
    end
    
    subgraph "Orchestration Layer"
        LG[LangGraph Orchestrator]
        WM[Workflow Metrics]
        WL[Workflow Logger]
        RC[Runtime Config Manager]
    end
    
    subgraph "Processing Layer"
        EMB[Embedder]
        RET[Retriever] 
        GEN[Generator]
        DOC[Document Processor]
        PT[Prompt Templates]
    end
    
    subgraph "Storage Layer"
        QD[Qdrant Vector DB]
        FS[File System]
        CM[Config Management]
    end
    
    subgraph "External Services"
        LMS[LMStudio Server<br/>gemma-3-4b-it]
    end
    
    subgraph "Error Handling & Monitoring"
        EH[Error Handler]
        RF[Response Formatter]
        LOG[Structured Logging]
    end
    
    CLI --> LG
    API --> LG
    BATCH --> LG
    LG --> EMB
    LG --> RET
    LG --> GEN
    LG --> DOC
    LG --> WM
    LG --> WL
    EMB --> QD
    RET --> QD
    DOC --> FS
    GEN --> LMS
    GEN --> PT
    RC --> CM
    EH --> LOG
    RF --> CLI
    EH --> CLI
Loading

Components and Interfaces

Document Processor

Purpose: Handles document ingestion, parsing, and chunking with metadata extraction Key Methods:

  • process_document(file_path: str) -> List[DocumentChunk]
  • process_batch(file_paths: List[str]) -> List[DocumentChunk]
  • extract_text(file_path: str) -> str
  • chunk_text(text: str, config: ChunkingConfig) -> List[str]

Supported Formats: PDF, TXT, Markdown Chunking Strategy: Configurable via ChunkingConfig with recursive text splitting, overlap control, and metadata preservation

Embedder

Purpose: Converts text into vector embeddings using sentence-transformers Key Methods:

  • embed_text(text: str) -> List[float]
  • embed_batch(texts: List[str]) -> List[List[float]]
  • get_embedding_dimension() -> int
  • get_model_info() -> Dict[str, Any]

Model Configuration:

  • Default: second-state/All-MiniLM-L6-v2-Embedding-GGUF (384 dimensions)
  • Device auto-detection (CUDA/CPU)
  • Configurable batch size and normalization
  • EmbeddingConfig dataclass for flexible configuration

Qdrant Document Store

Purpose: Stores and indexes document embeddings with duplicate detection Key Methods:

  • add_documents(documents: List[DocumentChunk]) -> bool
  • search_similar(query_embedding: List[float], top_k: int) -> List[DocumentChunk]
  • find_similar_to_document(document_id: str, top_k: int) -> List[DocumentChunk]
  • delete_collection(collection_name: str) -> bool
  • get_collection_info() -> Dict[str, Any]

Configuration:

  • Collection name: configurable per knowledge base
  • Vector dimension: 384 (matches All-MiniLM-L6-v2-Embedding-GGUF)
  • Distance metric: Cosine similarity
  • Index type: HNSW for performance
  • Duplicate detection: 0.95 similarity threshold

Retriever

Purpose: Orchestrates query embedding and similarity search with validation Key Methods:

  • retrieve(query: str, top_k: int = 5) -> List[DocumentChunk]
  • retrieve_batch(queries: List[str], top_k: int = 5) -> List[List[DocumentChunk]]
  • test_connection() -> bool
  • get_retrieval_stats() -> Dict[str, Any]

Features:

  • Query preprocessing and normalization
  • Similarity threshold filtering
  • Metadata preservation (source, chunk position)
  • Batch retrieval for multiple queries
  • Connection testing and validation

LMStudio Client

Purpose: Interfaces with local LMStudio server with fallback endpoints Key Methods:

  • generate(prompt: str, **kwargs) -> str
  • generate_async(prompt: str, **kwargs) -> str
  • check_connection() -> bool
  • get_available_models() -> List[str]

Configuration:

  • Server endpoint (default: http://localhost:1234)
  • Model: unsloth/gemma-3-4b-it-GGUF
  • Parameters: temperature (0.7), max_tokens (1000), top_p (0.9)
  • Multiple endpoint fallback (v1/completions, v1/chat/completions)
  • Retry logic with exponential backoff

Generator

Purpose: Combines retrieved context with queries using advanced prompt templates Key Methods:

  • generate_response(query: str, context: List[DocumentChunk]) -> Response
  • generate_no_context_response(query: str) -> Response
  • format_prompt(query: str, context: List[DocumentChunk]) -> str
  • calculate_confidence_score(response: str, context: List[DocumentChunk]) -> float

Prompt Templates:

  • PromptTemplate: Basic template with configurable sections
  • AdvancedPromptTemplate: Enhanced with source grouping and citation
  • Factory function for template creation
  • No-context fallback templates

LangGraph Orchestrator

Purpose: Manages the RAG workflow using graph-based orchestration with observability Workflow Nodes:

  1. Query Processing: Validate and preprocess user input
  2. Retrieval: Search for relevant documents with embedding generation
  3. Context Evaluation: Assess relevance of retrieved documents
  4. Generation: Generate response using LMStudio with context
  5. No Context Generation: Handle queries with no relevant context
  6. Search Expansion: Retry with relaxed parameters (max 2 attempts)
  7. Response Formatting: Structure final output with metadata and confidence

Key Features:

  • Async and sync processing modes
  • Thread-based conversation tracking
  • State persistence via memory checkpointer
  • Comprehensive workflow metrics collection
  • Error propagation and recovery

Workflow Graph:

graph TD
    A[Start] --> B[Process Query]
    B --> C[Retrieve Documents]
    C --> D{Documents Found?}
    D -->|Yes| E[Evaluate Relevance]
    D -->|No| F[No Context Response]
    E --> G{Relevant Context?}
    G -->|Yes| H[Generate Response]
    G -->|No| I{Expansion Attempts < 2?}
    I -->|Yes| J[Expand Search]
    I -->|No| F
    J --> C
    H --> K[Format Output]
    F --> K
    K --> L[End]
Loading

Runtime Configuration Manager

Purpose: Manages dynamic configuration updates with validation and persistence Key Methods:

  • update_parameter(parameter: str, value: Any) -> bool
  • batch_update(updates: Dict[str, Any]) -> bool
  • get_change_history() -> List[ConfigChange]
  • reset_to_defaults() -> bool

Features:

  • Runtime-updatable vs restart-required parameter classification
  • Change history tracking with timestamps
  • Callback system for configuration changes
  • Atomic batch updates with rollback capability
  • Configuration persistence to file

Data Models

DocumentChunk

@dataclass
class DocumentChunk:
    id: str = field(default_factory=lambda: str(uuid.uuid4()))
    content: str = ""
    embedding: Optional[List[float]] = None
    metadata: Dict[str, Any] = field(default_factory=dict)
    source: str = ""
    chunk_index: int = 0
    
    def to_dict(self) -> Dict[str, Any]:
        """Convert to dictionary for storage"""
        
    def to_json(self) -> str:
        """Convert to JSON string"""
        
    @classmethod
    def from_dict(cls, data: Dict[str, Any]) -> 'DocumentChunk':
        """Create from dictionary"""
        
    @classmethod
    def from_json(cls, json_str: str) -> 'DocumentChunk':
        """Create from JSON string"""

Query

@dataclass
class Query:
    text: str
    embedding: Optional[List[float]] = None
    filters: Optional[Dict[str, Any]] = None
    top_k: int = 5
    similarity_threshold: float = 0.7
    timestamp: datetime = field(default_factory=datetime.now)
    
    def to_dict(self) -> Dict[str, Any]:
        """Convert to dictionary"""
        
    @classmethod
    def from_dict(cls, data: Dict[str, Any]) -> 'Query':
        """Create from dictionary"""

Response

@dataclass
class Response:
    answer: str
    source_documents: List[DocumentChunk]
    confidence_score: float
    processing_time: float
    metadata: Dict[str, Any]
    timestamp: datetime = field(default_factory=datetime.now)
    
    def to_dict(self) -> Dict[str, Any]:
        """Convert to dictionary"""
        
    def to_json(self) -> str:
        """Convert to JSON string"""
        
    @classmethod
    def from_dict(cls, data: Dict[str, Any]) -> 'Response':
        """Create from dictionary"""
        
    @classmethod
    def from_json(cls, json_str: str) -> 'Response':
        """Create from JSON string"""

ProcessingStats

@dataclass
class ProcessingStats:
    documents_processed: int = 0
    chunks_created: int = 0
    embeddings_generated: int = 0
    queries_processed: int = 0
    average_processing_time: float = 0.0
    total_processing_time: float = 0.0
    errors_encountered: int = 0
    last_updated: datetime = field(default_factory=datetime.now)
    
    def update_query_stats(self, processing_time: float) -> None:
        """Update statistics for query processing"""
        
    def update_document_stats(self, chunks_count: int, embeddings_count: int) -> None:
        """Update statistics for document processing"""
        
    def record_error(self) -> None:
        """Record an error occurrence"""
        
    def to_dict(self) -> Dict[str, Any]:
        """Convert to dictionary"""

Configuration Models

RAGConfig

@dataclass
class RAGConfig:
    # Qdrant settings
    qdrant_host: str = "localhost"
    qdrant_port: int = 6333
    collection_name: str = "documents"
    
    # LMStudio settings
    lmstudio_endpoint: str = "http://localhost:1234"
    model_name: str = "unsloth/gemma-3-4b-it-GGUF"
    temperature: float = 0.7
    max_tokens: int = 1000
    top_p: float = 0.9
    
    # Embedding settings
    embedding_model: str = "second-state/All-MiniLM-L6-v2-Embedding-GGUF"
    embedding_dimension: int = 384
    
    # Retrieval settings
    default_top_k: int = 5
    similarity_threshold: float = 0.7
    chunk_size: int = 1000
    chunk_overlap: int = 200
    
    # Logging settings
    log_level: str = "INFO"
    log_file: Optional[str] = None
    
    # Performance settings
    batch_size: int = 32
    max_concurrent_requests: int = 10
    device: Optional[str] = None  # Auto-detect if None
    
    def validate(self) -> None:
        """Validate configuration parameters"""
        
    @classmethod
    def from_file(cls, file_path: str) -> 'RAGConfig':
        """Load configuration from file"""
        
    def to_file(self, file_path: str) -> None:
        """Save configuration to file"""

EmbeddingConfig

@dataclass
class EmbeddingConfig:
    model_name: str = "second-state/All-MiniLM-L6-v2-Embedding-GGUF"
    batch_size: int = 32
    max_seq_length: Optional[int] = None
    device: Optional[str] = None
    normalize_embeddings: bool = True
    show_progress_bar: bool = False

ChunkingConfig

@dataclass
class ChunkingConfig:
    chunk_size: int = 1000
    chunk_overlap: int = 200
    separators: List[str] = field(default_factory=lambda: ["\n\n", "\n", " ", ""])
    keep_separator: bool = True
    add_start_index: bool = True

Workflow Models

WorkflowState

class WorkflowState(TypedDict):
    query: Optional[Query]
    retrieved_documents: List[DocumentChunk]
    generated_response: Optional[str]
    final_response: Optional[Response]
    error: Optional[str]
    metadata: Dict[str, Any]
    processing_start_time: Optional[float]

ConfigChange

@dataclass
class ConfigChange:
    parameter: str
    old_value: Any
    new_value: Any
    timestamp: datetime = field(default_factory=datetime.now)
    source: str = "runtime"  # runtime, file, api, etc.
    
    def to_dict(self) -> Dict[str, Any]:
        """Convert to dictionary"""

Correctness Properties

A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.

Converting EARS to Properties

Based on the prework analysis, I'll convert the testable acceptance criteria into universally quantified properties:

Property 1: Document Ingestion Round Trip For any valid document in supported formats (PDF, TXT, Markdown), ingesting it into the system should result in retrievable embeddings in Qdrant with preserved content and metadata Validates: Requirements 1.1, 1.2, 1.4

Property 2: Embedding Consistency For any text content, the Embedder should produce consistent numerical vectors of the expected dimension across multiple calls Validates: Requirements 1.2

Property 3: Duplicate Document Handling For any document that is ingested multiple times, the Document_Store should handle duplicates appropriately without corrupting the collection Validates: Requirements 1.5

Property 4: Query Retrieval Completeness For any user query, the Retriever should convert it to embeddings, search Qdrant, and return results with complete metadata (source, chunk position) Validates: Requirements 2.1, 2.5

Property 5: Similarity Ranking Correctness For any query with known similar documents in the collection, the Retriever should return documents ordered by relevance with the most similar documents first Validates: Requirements 2.2

Property 6: Top-K Parameter Compliance For any valid top-k value, the Retriever should return exactly that number of documents (or fewer if insufficient documents exist) Validates: Requirements 2.3

Property 7: Threshold Filtering For any query with no documents above the similarity threshold, the Retriever should return an empty result set Validates: Requirements 2.4

Property 8: Generation Parameter Application For any valid model parameters (temperature, max_tokens, top_p), the LMStudio_Client should apply them correctly when generating responses Validates: Requirements 3.3

Property 9: Context Integration For any user query and retrieved document context, the Generator should produce responses that incorporate both the query and the provided context Validates: Requirements 4.1, 4.2

Property 10: Source Citation For any generated response with source documents, the response should include references or citations to the source documents used Validates: Requirements 4.4

Property 11: Response Completeness For any processed query, the RAG_System should return a structured response containing generated text, source documents, and metadata Validates: Requirements 4.5, 7.3

Property 12: Workflow Execution Order For any query processed through LangGraph_Orchestrator, the workflow steps should execute in the correct sequence: query processing → retrieval → generation → response formatting Validates: Requirements 5.2

Property 13: Conditional Workflow Handling For any query that results in no retrieved documents, the LangGraph_Orchestrator should execute the appropriate conditional logic and generate a "no relevant information" response Validates: Requirements 5.3

Property 14: Configuration Validation For any configuration file with invalid settings, the RAG_System should detect the invalid values and provide clear error messages Validates: Requirements 6.2

Property 15: Runtime Parameter Updates For any valid runtime parameter change (retrieval count, generation settings), the system should apply the new parameters to subsequent operations Validates: Requirements 6.3

Property 16: Batch Processing Consistency For any set of queries processed in batch mode, each query should receive the same quality of processing as if processed individually Validates: Requirements 7.4

Property 17: Error Message Clarity For any error condition during query processing, the RAG_System should provide clear error messages that help users understand and resolve the issue Validates: Requirements 7.5

Property 18: Incremental Document Addition For any existing document collection, adding new documents should not require reprocessing of existing documents Validates: Requirements 8.3

Property 19: Batch Processing Efficiency For any large set of documents, the Embedder should process them in batches rather than individually to optimize memory usage Validates: Requirements 8.4

Error Handling

The system implements comprehensive error handling with structured logging, user-friendly messages, and actionable suggestions:

Error Handling Architecture

Error Categories and Severity

  • Categories: Configuration, Connection, Authentication, Validation, Processing, Resource, System, User Input
  • Severity Levels: Low, Medium, High, Critical
  • Structured Context: ErrorContext dataclass with operation, component, user action, system state, and timestamp

Error Handler Components

  • RAGSystemError: Base exception class with category, severity, and suggestions
  • ErrorSuggestion: Actionable suggestions with priority and technical level
  • ErrorFormatter: Rich console formatting for user-friendly error display
  • Structured Logging: Using structlog for machine-readable error logs

Component-Specific Error Handling

Document Processing Errors

  • Unsupported file formats: Clear error messages with supported format list
  • Corrupted files: Graceful handling with specific error details and recovery suggestions
  • Large file handling: Memory management with progress indicators and chunking
  • Permission errors: Clear file access error messages with permission fix suggestions

Vector Database Errors

  • Qdrant connection failures: Retry logic with exponential backoff (3 attempts)
  • Collection creation errors: Validation and recovery procedures with detailed diagnostics
  • Index corruption: Detection and rebuilding capabilities with backup recommendations
  • Storage space issues: Monitoring and alerts with cleanup suggestions
  • Duplicate detection: 0.95 similarity threshold with conflict resolution

LMStudio Integration Errors

  • Server unavailability: Connection testing with multiple endpoint fallback
  • Model loading failures: Model validation with alternative model suggestions
  • Generation timeouts: Configurable timeout (30s default) with retry logic
  • Invalid parameters: Parameter validation with range suggestions and examples
  • Rate limiting: Backoff strategy with queue management

Workflow Orchestration Errors

  • Node execution failures: Error propagation with state preservation and recovery options
  • State management errors: State validation with rollback capabilities and checkpoint recovery
  • Resource exhaustion: Memory and processing limits with graceful degradation
  • Configuration errors: Comprehensive validation with specific parameter fix suggestions
  • Thread management: Conversation tracking with cleanup and recovery

Error Recovery Strategies

Automatic Recovery

  • Connection retry: Exponential backoff for Qdrant and LMStudio connections
  • Search expansion: Automatic retry with relaxed similarity thresholds (max 2 attempts)
  • Fallback responses: No-context response generation when retrieval fails
  • Graceful degradation: Reduced functionality when components are unavailable

User-Guided Recovery

  • Configuration validation: Step-by-step configuration fix suggestions
  • Service health checks: Automated diagnostics with specific fix instructions
  • Error context: Detailed error information with troubleshooting steps
  • Recovery commands: CLI commands for common recovery scenarios

Monitoring and Observability

Workflow Metrics

  • WorkflowMetrics: Comprehensive metrics collection for performance monitoring
  • WorkflowLogger: Structured logging with configurable output formats
  • ObservabilityMixin: Metrics collection integration for all components
  • Performance tracking: Response times, success rates, error frequencies

Error Tracking

  • Error history: Persistent error log with categorization and trends
  • Error aggregation: Common error pattern detection and reporting
  • Alert thresholds: Configurable error rate alerts for proactive monitoring
  • Diagnostic exports: Error context export for debugging and support

Testing Strategy

The testing strategy employs a comprehensive approach combining unit tests, integration tests, and property-based tests with full observability:

Testing Architecture

Unit Testing Approach

  • Component isolation: Test individual components (Embedder, Retriever, Generator, RuntimeConfigManager) in isolation
  • Interface testing: Test all public methods and error conditions for each component
  • Mock dependencies: Use mocks for LMStudio and Qdrant during unit testing with realistic response simulation
  • Configuration validation: Test various configuration combinations and edge cases
  • Error condition testing: Comprehensive error scenario testing with recovery validation

Integration Testing

  • End-to-end workflows: Test complete RAG pipeline from document ingestion to response generation
  • LangGraph workflow testing: Test all workflow nodes and conditional routing paths
  • External service integration: Test actual integration with Qdrant and LMStudio servers
  • Concurrent access: Test system behavior with multiple simultaneous queries and document processing
  • Data persistence: Verify data integrity across system restarts and configuration changes
  • Performance benchmarks: Validate response times and memory usage under various load conditions

Property-Based Testing Configuration

  • Testing framework: Use Hypothesis for Python property-based testing
  • Test iterations: Minimum 100 iterations per property test for statistical confidence
  • Smart data generation: Realistic generators for documents, queries, configurations, and embeddings
  • Shrinking optimization: Leverage Hypothesis shrinking to find minimal failing examples
  • Test categorization: Each property test tagged with format: Feature: rag-system, Property {number}: {property_text}

Test Data Generation Strategies

Document Generation

  • Format variety: Generate PDF, TXT, and Markdown documents with varying structures
  • Content diversity: Create documents with different lengths (100-10000 words), languages, and technical content
  • Metadata variation: Generate realistic metadata including timestamps, authors, and categories
  • Edge cases: Empty documents, very large documents, special characters, and encoding issues

Query Generation

  • Complexity levels: Simple keyword queries to complex multi-part questions
  • Similarity patterns: Queries with known similar documents for retrieval validation
  • Language variety: Multi-language queries for internationalization testing
  • Edge cases: Empty queries, very long queries, special characters, and malformed input

Configuration Generation

  • Valid combinations: Generate realistic configuration combinations for different deployment scenarios
  • Invalid scenarios: Create invalid configurations to test validation and error handling
  • Boundary testing: Test parameter limits and edge values
  • Runtime updates: Generate configuration change sequences for runtime update testing

Embedding Generation

  • Realistic vectors: Generate embeddings that match expected dimensionality and distribution
  • Similarity testing: Create embedding pairs with known similarity relationships
  • Performance testing: Generate large embedding sets for batch processing validation
  • Error simulation: Create malformed embeddings for error handling testing

Testing Environment Setup

Containerized Testing

  • Docker containers: Consistent test environments with Qdrant and LMStudio services
  • Test isolation: Separate containers for parallel test execution
  • Service orchestration: Docker Compose for complete testing stack setup
  • Environment cleanup: Automated container cleanup and resource management

Test Data Management

  • Automated setup: Scripted test collection creation and population
  • Data fixtures: Reusable test datasets for consistent testing
  • Cleanup procedures: Automated teardown of test collections and temporary files
  • Backup and restore: Test data versioning and restoration capabilities

CI/CD Integration

  • Automated testing: Full test suite execution on code changes
  • Parallel execution: Concurrent test running for faster feedback
  • Coverage reporting: Comprehensive code coverage tracking with minimum thresholds
  • Performance regression: Automated performance benchmark comparison
  • Test result reporting: Detailed test reports with failure analysis and trends

Observability Testing

Metrics Validation

  • Workflow metrics: Validate WorkflowMetrics collection and accuracy
  • Performance tracking: Test response time measurement and aggregation
  • Error rate monitoring: Validate error frequency tracking and alerting
  • Resource usage: Test memory and CPU usage monitoring

Logging Testing

  • Structured logging: Validate structlog output format and content
  • Log level filtering: Test log level configuration and filtering
  • Error context: Validate ErrorContext information capture and formatting
  • Log aggregation: Test log collection and analysis capabilities

Production Testing

Load Testing

  • Concurrent queries: Test system behavior under high query load
  • Document ingestion: Test batch document processing performance
  • Memory usage: Validate memory consumption under sustained load
  • Error recovery: Test system recovery under failure conditions

Chaos Testing

  • Service failures: Test behavior when Qdrant or LMStudio becomes unavailable
  • Network issues: Test timeout and retry behavior under network problems
  • Resource constraints: Test behavior under memory and disk space limitations
  • Configuration corruption: Test recovery from invalid configuration states

Test Coverage Requirements

Code Coverage

  • Minimum threshold: 85% code coverage across all modules
  • Critical path coverage: 100% coverage for error handling and recovery paths
  • Integration coverage: Full coverage of component interaction points
  • Configuration coverage: Complete coverage of all configuration parameters and validation

Functional Coverage

  • Feature completeness: All documented features must have corresponding tests
  • Error scenarios: All identified error conditions must be tested
  • Edge cases: Boundary conditions and edge cases must be covered
  • Performance requirements: All performance requirements must be validated