Refactor: Apply Single Responsibility Principle throughout codebase by martgra · Pull Request #4 · martgra/lovdata-pipeline

martgra · 2025-11-19T21:17:01Z

This comprehensive refactoring decouples components and ensures each class/module
has a single, well-defined responsibility.

Key Changes

1. Protocol Abstractions (Decoupling)

EmbeddingProvider protocol: Abstracts embedding operations
- Decouples from OpenAI-specific implementation
- Enables easy swapping of embedding providers (Cohere, local models, etc.)
VectorStoreRepository protocol: Abstracts vector storage
- Decouples from ChromaDB-specific implementation
- Enables easy swapping of vector databases (Pinecone, Weaviate, etc.)

2. Infrastructure Implementations

OpenAIEmbeddingProvider: OpenAI implementation of EmbeddingProvider
ChromaVectorStoreRepository: ChromaDB implementation of VectorStoreRepository

3. Domain Services (Single Responsibility)

XMLParsingService: Parse XML files and extract articles
- Responsibility: XML parsing only
ChunkingService: Split articles into token-sized chunks
- Responsibility: Article chunking coordination
EmbeddingService: Generate embeddings using provider
- Responsibility: Embedding coordination
- Decoupled from progress tracking (uses optional callbacks)
- Decoupled from specific providers (uses EmbeddingProvider)
FileProcessingService: Orchestrate parse → chunk → embed → index
- Responsibility: Single file processing pipeline
- Coordinates the domain services

4. Pipeline Orchestrator

PipelineOrchestrator: Coordinate high-level pipeline stages
- Responsibility: Sync → Identify → Process → Cleanup
- Uses dependency injection for all services
- Clean separation from infrastructure concerns

5. Simplified pipeline.py

Reduced from 362 lines to 126 lines
Changed from 7+ responsibilities to 1 responsibility
New responsibility: Dependency injection and wiring
Provides backward-compatible run_pipeline() function
All business logic moved to services and orchestrator

Benefits

Testability

Each service can be tested in isolation
Easy to mock dependencies using protocols
No need for complex integration test setup

Maintainability

Clear separation of concerns
Each class/module has single, obvious responsibility
Easier to understand and modify

Flexibility

Easy to swap implementations (OpenAI → Cohere, ChromaDB → Pinecone)
Can reuse services in different contexts
Progress tracking decoupled via callbacks

Code Quality

Reduced coupling between components
Better adherence to SOLID principles
More focused, cohesive modules

Architecture

CLI (cli.py)
  ↓
Pipeline Factory (pipeline.py)
  ↓
PipelineOrchestrator
  ├── FileProcessingService
  │   ├── XMLParsingService
  │   ├── ChunkingService
  │   ├── EmbeddingService (uses EmbeddingProvider)
  │   └── VectorStoreRepository
  └── VectorStoreRepository

Protocols (Abstractions):
  - EmbeddingProvider
  - VectorStoreRepository

Implementations:
  - OpenAIEmbeddingProvider
  - ChromaVectorStoreRepository

Backward Compatibility

Existing CLI unchanged
run_pipeline() function signature unchanged
All existing functionality preserved

This comprehensive refactoring decouples components and ensures each class/module has a single, well-defined responsibility. ## Key Changes ### 1. Protocol Abstractions (Decoupling) - **EmbeddingProvider protocol**: Abstracts embedding operations - Decouples from OpenAI-specific implementation - Enables easy swapping of embedding providers (Cohere, local models, etc.) - **VectorStoreRepository protocol**: Abstracts vector storage - Decouples from ChromaDB-specific implementation - Enables easy swapping of vector databases (Pinecone, Weaviate, etc.) ### 2. Infrastructure Implementations - **OpenAIEmbeddingProvider**: OpenAI implementation of EmbeddingProvider - **ChromaVectorStoreRepository**: ChromaDB implementation of VectorStoreRepository ### 3. Domain Services (Single Responsibility) - **XMLParsingService**: Parse XML files and extract articles - Responsibility: XML parsing only - **ChunkingService**: Split articles into token-sized chunks - Responsibility: Article chunking coordination - **EmbeddingService**: Generate embeddings using provider - Responsibility: Embedding coordination - Decoupled from progress tracking (uses optional callbacks) - Decoupled from specific providers (uses EmbeddingProvider) - **FileProcessingService**: Orchestrate parse → chunk → embed → index - Responsibility: Single file processing pipeline - Coordinates the domain services ### 4. Pipeline Orchestrator - **PipelineOrchestrator**: Coordinate high-level pipeline stages - Responsibility: Sync → Identify → Process → Cleanup - Uses dependency injection for all services - Clean separation from infrastructure concerns ### 5. Simplified pipeline.py - Reduced from 362 lines to 126 lines - Changed from 7+ responsibilities to 1 responsibility - **New responsibility**: Dependency injection and wiring - Provides backward-compatible `run_pipeline()` function - All business logic moved to services and orchestrator ## Benefits ### Testability - Each service can be tested in isolation - Easy to mock dependencies using protocols - No need for complex integration test setup ### Maintainability - Clear separation of concerns - Each class/module has single, obvious responsibility - Easier to understand and modify ### Flexibility - Easy to swap implementations (OpenAI → Cohere, ChromaDB → Pinecone) - Can reuse services in different contexts - Progress tracking decoupled via callbacks ### Code Quality - Reduced coupling between components - Better adherence to SOLID principles - More focused, cohesive modules ## Architecture ``` CLI (cli.py) ↓ Pipeline Factory (pipeline.py) ↓ PipelineOrchestrator ├── FileProcessingService │ ├── XMLParsingService │ ├── ChunkingService │ ├── EmbeddingService (uses EmbeddingProvider) │ └── VectorStoreRepository └── VectorStoreRepository Protocols (Abstractions): - EmbeddingProvider - VectorStoreRepository Implementations: - OpenAIEmbeddingProvider - ChromaVectorStoreRepository ``` ## Backward Compatibility - Existing CLI unchanged - run_pipeline() function signature unchanged - All existing functionality preserved

Bug #1: Fix exception handling in migration command - Added missing 'as e' to except clause - Changed to f-strings for proper interpolation - Location: cli.py:247-251 Bug #2: Fix incorrect skip count calculation in orchestrator - Track total_available files correctly for both force and normal modes - Fix calculation that was using wrong variable when force=True - Location: pipeline_orchestrator.py:186-208 Bug #3: Remove missing total_vectors reference - Removed reference to non-existent 'total_vectors' field in state stats - Prevents KeyError crash in status command - Location: cli.py:304-307 Bug #4: Fix inconsistent token check in chunker - Changed '<' to '<=' to match rest of codebase - Added warning logging for chunks exceeding max tokens - Prevents silent data loss when chunks are at max_tokens - Location: lovdata_chunker.py:372, 386-391 Tests: - Added TestTokenLimits class with 2 new tests for Bug #4 - All non-tiktoken tests passing (19/19) - Chunker tests blocked by tiktoken network issue (environment)

martgra merged commit 56335f8 into main Nov 19, 2025
1 check failed

martgra mentioned this pull request Nov 20, 2025

Fix 4 critical bugs found during app review #6

Merged

martgra deleted the claude/refactor-single-responsibility-01CcTCRTFDin1iztFUVTJQB5 branch November 22, 2025 08:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: Apply Single Responsibility Principle throughout codebase#4

Refactor: Apply Single Responsibility Principle throughout codebase#4
martgra merged 1 commit into
mainfrom
claude/refactor-single-responsibility-01CcTCRTFDin1iztFUVTJQB5

martgra commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

martgra commented Nov 19, 2025

Key Changes

1. Protocol Abstractions (Decoupling)

2. Infrastructure Implementations

3. Domain Services (Single Responsibility)

4. Pipeline Orchestrator

5. Simplified pipeline.py

Benefits

Testability

Maintainability

Flexibility

Code Quality

Architecture

Backward Compatibility

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants