Skip to content

Refactor: Apply Single Responsibility Principle throughout codebase#4

Merged
martgra merged 1 commit into
mainfrom
claude/refactor-single-responsibility-01CcTCRTFDin1iztFUVTJQB5
Nov 19, 2025
Merged

Refactor: Apply Single Responsibility Principle throughout codebase#4
martgra merged 1 commit into
mainfrom
claude/refactor-single-responsibility-01CcTCRTFDin1iztFUVTJQB5

Conversation

@martgra
Copy link
Copy Markdown
Owner

@martgra martgra commented Nov 19, 2025

This comprehensive refactoring decouples components and ensures each class/module
has a single, well-defined responsibility.

Key Changes

1. Protocol Abstractions (Decoupling)

  • EmbeddingProvider protocol: Abstracts embedding operations

    • Decouples from OpenAI-specific implementation
    • Enables easy swapping of embedding providers (Cohere, local models, etc.)
  • VectorStoreRepository protocol: Abstracts vector storage

    • Decouples from ChromaDB-specific implementation
    • Enables easy swapping of vector databases (Pinecone, Weaviate, etc.)

2. Infrastructure Implementations

  • OpenAIEmbeddingProvider: OpenAI implementation of EmbeddingProvider
  • ChromaVectorStoreRepository: ChromaDB implementation of VectorStoreRepository

3. Domain Services (Single Responsibility)

  • XMLParsingService: Parse XML files and extract articles

    • Responsibility: XML parsing only
  • ChunkingService: Split articles into token-sized chunks

    • Responsibility: Article chunking coordination
  • EmbeddingService: Generate embeddings using provider

    • Responsibility: Embedding coordination
    • Decoupled from progress tracking (uses optional callbacks)
    • Decoupled from specific providers (uses EmbeddingProvider)
  • FileProcessingService: Orchestrate parse → chunk → embed → index

    • Responsibility: Single file processing pipeline
    • Coordinates the domain services

4. Pipeline Orchestrator

  • PipelineOrchestrator: Coordinate high-level pipeline stages
    • Responsibility: Sync → Identify → Process → Cleanup
    • Uses dependency injection for all services
    • Clean separation from infrastructure concerns

5. Simplified pipeline.py

  • Reduced from 362 lines to 126 lines
  • Changed from 7+ responsibilities to 1 responsibility
  • New responsibility: Dependency injection and wiring
  • Provides backward-compatible run_pipeline() function
  • All business logic moved to services and orchestrator

Benefits

Testability

  • Each service can be tested in isolation
  • Easy to mock dependencies using protocols
  • No need for complex integration test setup

Maintainability

  • Clear separation of concerns
  • Each class/module has single, obvious responsibility
  • Easier to understand and modify

Flexibility

  • Easy to swap implementations (OpenAI → Cohere, ChromaDB → Pinecone)
  • Can reuse services in different contexts
  • Progress tracking decoupled via callbacks

Code Quality

  • Reduced coupling between components
  • Better adherence to SOLID principles
  • More focused, cohesive modules

Architecture

CLI (cli.py)
  ↓
Pipeline Factory (pipeline.py)
  ↓
PipelineOrchestrator
  ├── FileProcessingService
  │   ├── XMLParsingService
  │   ├── ChunkingService
  │   ├── EmbeddingService (uses EmbeddingProvider)
  │   └── VectorStoreRepository
  └── VectorStoreRepository

Protocols (Abstractions):
  - EmbeddingProvider
  - VectorStoreRepository

Implementations:
  - OpenAIEmbeddingProvider
  - ChromaVectorStoreRepository

Backward Compatibility

  • Existing CLI unchanged
  • run_pipeline() function signature unchanged
  • All existing functionality preserved

This comprehensive refactoring decouples components and ensures each class/module
has a single, well-defined responsibility.

## Key Changes

### 1. Protocol Abstractions (Decoupling)
- **EmbeddingProvider protocol**: Abstracts embedding operations
  - Decouples from OpenAI-specific implementation
  - Enables easy swapping of embedding providers (Cohere, local models, etc.)

- **VectorStoreRepository protocol**: Abstracts vector storage
  - Decouples from ChromaDB-specific implementation
  - Enables easy swapping of vector databases (Pinecone, Weaviate, etc.)

### 2. Infrastructure Implementations
- **OpenAIEmbeddingProvider**: OpenAI implementation of EmbeddingProvider
- **ChromaVectorStoreRepository**: ChromaDB implementation of VectorStoreRepository

### 3. Domain Services (Single Responsibility)
- **XMLParsingService**: Parse XML files and extract articles
  - Responsibility: XML parsing only

- **ChunkingService**: Split articles into token-sized chunks
  - Responsibility: Article chunking coordination

- **EmbeddingService**: Generate embeddings using provider
  - Responsibility: Embedding coordination
  - Decoupled from progress tracking (uses optional callbacks)
  - Decoupled from specific providers (uses EmbeddingProvider)

- **FileProcessingService**: Orchestrate parse → chunk → embed → index
  - Responsibility: Single file processing pipeline
  - Coordinates the domain services

### 4. Pipeline Orchestrator
- **PipelineOrchestrator**: Coordinate high-level pipeline stages
  - Responsibility: Sync → Identify → Process → Cleanup
  - Uses dependency injection for all services
  - Clean separation from infrastructure concerns

### 5. Simplified pipeline.py
- Reduced from 362 lines to 126 lines
- Changed from 7+ responsibilities to 1 responsibility
- **New responsibility**: Dependency injection and wiring
- Provides backward-compatible `run_pipeline()` function
- All business logic moved to services and orchestrator

## Benefits

### Testability
- Each service can be tested in isolation
- Easy to mock dependencies using protocols
- No need for complex integration test setup

### Maintainability
- Clear separation of concerns
- Each class/module has single, obvious responsibility
- Easier to understand and modify

### Flexibility
- Easy to swap implementations (OpenAI → Cohere, ChromaDB → Pinecone)
- Can reuse services in different contexts
- Progress tracking decoupled via callbacks

### Code Quality
- Reduced coupling between components
- Better adherence to SOLID principles
- More focused, cohesive modules

## Architecture

```
CLI (cli.py)
  ↓
Pipeline Factory (pipeline.py)
  ↓
PipelineOrchestrator
  ├── FileProcessingService
  │   ├── XMLParsingService
  │   ├── ChunkingService
  │   ├── EmbeddingService (uses EmbeddingProvider)
  │   └── VectorStoreRepository
  └── VectorStoreRepository

Protocols (Abstractions):
  - EmbeddingProvider
  - VectorStoreRepository

Implementations:
  - OpenAIEmbeddingProvider
  - ChromaVectorStoreRepository
```

## Backward Compatibility
- Existing CLI unchanged
- run_pipeline() function signature unchanged
- All existing functionality preserved
@martgra martgra merged commit 56335f8 into main Nov 19, 2025
1 check failed
martgra pushed a commit that referenced this pull request Nov 20, 2025
Bug #1: Fix exception handling in migration command
- Added missing 'as e' to except clause
- Changed to f-strings for proper interpolation
- Location: cli.py:247-251

Bug #2: Fix incorrect skip count calculation in orchestrator
- Track total_available files correctly for both force and normal modes
- Fix calculation that was using wrong variable when force=True
- Location: pipeline_orchestrator.py:186-208

Bug #3: Remove missing total_vectors reference
- Removed reference to non-existent 'total_vectors' field in state stats
- Prevents KeyError crash in status command
- Location: cli.py:304-307

Bug #4: Fix inconsistent token check in chunker
- Changed '<' to '<=' to match rest of codebase
- Added warning logging for chunks exceeding max tokens
- Prevents silent data loss when chunks are at max_tokens
- Location: lovdata_chunker.py:372, 386-391

Tests:
- Added TestTokenLimits class with 2 new tests for Bug #4
- All non-tiktoken tests passing (19/19)
- Chunker tests blocked by tiktoken network issue (environment)
@martgra martgra deleted the claude/refactor-single-responsibility-01CcTCRTFDin1iztFUVTJQB5 branch November 22, 2025 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants