Refactor: Apply Single Responsibility Principle throughout codebase#4
Merged
martgra merged 1 commit intoNov 19, 2025
Conversation
This comprehensive refactoring decouples components and ensures each class/module has a single, well-defined responsibility. ## Key Changes ### 1. Protocol Abstractions (Decoupling) - **EmbeddingProvider protocol**: Abstracts embedding operations - Decouples from OpenAI-specific implementation - Enables easy swapping of embedding providers (Cohere, local models, etc.) - **VectorStoreRepository protocol**: Abstracts vector storage - Decouples from ChromaDB-specific implementation - Enables easy swapping of vector databases (Pinecone, Weaviate, etc.) ### 2. Infrastructure Implementations - **OpenAIEmbeddingProvider**: OpenAI implementation of EmbeddingProvider - **ChromaVectorStoreRepository**: ChromaDB implementation of VectorStoreRepository ### 3. Domain Services (Single Responsibility) - **XMLParsingService**: Parse XML files and extract articles - Responsibility: XML parsing only - **ChunkingService**: Split articles into token-sized chunks - Responsibility: Article chunking coordination - **EmbeddingService**: Generate embeddings using provider - Responsibility: Embedding coordination - Decoupled from progress tracking (uses optional callbacks) - Decoupled from specific providers (uses EmbeddingProvider) - **FileProcessingService**: Orchestrate parse → chunk → embed → index - Responsibility: Single file processing pipeline - Coordinates the domain services ### 4. Pipeline Orchestrator - **PipelineOrchestrator**: Coordinate high-level pipeline stages - Responsibility: Sync → Identify → Process → Cleanup - Uses dependency injection for all services - Clean separation from infrastructure concerns ### 5. Simplified pipeline.py - Reduced from 362 lines to 126 lines - Changed from 7+ responsibilities to 1 responsibility - **New responsibility**: Dependency injection and wiring - Provides backward-compatible `run_pipeline()` function - All business logic moved to services and orchestrator ## Benefits ### Testability - Each service can be tested in isolation - Easy to mock dependencies using protocols - No need for complex integration test setup ### Maintainability - Clear separation of concerns - Each class/module has single, obvious responsibility - Easier to understand and modify ### Flexibility - Easy to swap implementations (OpenAI → Cohere, ChromaDB → Pinecone) - Can reuse services in different contexts - Progress tracking decoupled via callbacks ### Code Quality - Reduced coupling between components - Better adherence to SOLID principles - More focused, cohesive modules ## Architecture ``` CLI (cli.py) ↓ Pipeline Factory (pipeline.py) ↓ PipelineOrchestrator ├── FileProcessingService │ ├── XMLParsingService │ ├── ChunkingService │ ├── EmbeddingService (uses EmbeddingProvider) │ └── VectorStoreRepository └── VectorStoreRepository Protocols (Abstractions): - EmbeddingProvider - VectorStoreRepository Implementations: - OpenAIEmbeddingProvider - ChromaVectorStoreRepository ``` ## Backward Compatibility - Existing CLI unchanged - run_pipeline() function signature unchanged - All existing functionality preserved
martgra
pushed a commit
that referenced
this pull request
Nov 20, 2025
Bug #1: Fix exception handling in migration command - Added missing 'as e' to except clause - Changed to f-strings for proper interpolation - Location: cli.py:247-251 Bug #2: Fix incorrect skip count calculation in orchestrator - Track total_available files correctly for both force and normal modes - Fix calculation that was using wrong variable when force=True - Location: pipeline_orchestrator.py:186-208 Bug #3: Remove missing total_vectors reference - Removed reference to non-existent 'total_vectors' field in state stats - Prevents KeyError crash in status command - Location: cli.py:304-307 Bug #4: Fix inconsistent token check in chunker - Changed '<' to '<=' to match rest of codebase - Added warning logging for chunks exceeding max tokens - Prevents silent data loss when chunks are at max_tokens - Location: lovdata_chunker.py:372, 386-391 Tests: - Added TestTokenLimits class with 2 new tests for Bug #4 - All non-tiktoken tests passing (19/19) - Chunker tests blocked by tiktoken network issue (environment)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This comprehensive refactoring decouples components and ensures each class/module
has a single, well-defined responsibility.
Key Changes
1. Protocol Abstractions (Decoupling)
EmbeddingProvider protocol: Abstracts embedding operations
VectorStoreRepository protocol: Abstracts vector storage
2. Infrastructure Implementations
3. Domain Services (Single Responsibility)
XMLParsingService: Parse XML files and extract articles
ChunkingService: Split articles into token-sized chunks
EmbeddingService: Generate embeddings using provider
FileProcessingService: Orchestrate parse → chunk → embed → index
4. Pipeline Orchestrator
5. Simplified pipeline.py
run_pipeline()functionBenefits
Testability
Maintainability
Flexibility
Code Quality
Architecture
Backward Compatibility