Skip to content

Improve XML chunking algorithm implementation#5

Merged
martgra merged 1 commit into
mainfrom
claude/improve-xml-chunking-01AQutrcEUg2M2qv34a66azt
Nov 20, 2025
Merged

Improve XML chunking algorithm implementation#5
martgra merged 1 commit into
mainfrom
claude/improve-xml-chunking-01AQutrcEUg2M2qv34a66azt

Conversation

@martgra
Copy link
Copy Markdown
Owner

@martgra martgra commented Nov 20, 2025

Major improvements:

  • Target chunk size of 512 tokens (optimized for RAG) vs 6800 previously
  • 15% overlap between chunks for better context preservation
  • Better structure preservation (lists, continuations, leddfortsettelse)
  • Hierarchical context extraction (walks up XML tree for chapter/section info)
  • Cross-reference extraction for graph-based retrieval
  • Direct XML-to-chunks processing (removed intermediate parsing step)

Technical changes:

  • Added new LovdataChunker class with three-tier fallback strategy
  • Updated ChunkingService to use new chunker with chunk_file() API
  • Simplified FileProcessingService by removing XMLParsingService dependency
  • Updated pipeline factory with new chunking parameters
  • Marked old components as deprecated (xml_chunker, recursive_splitter, xml_parsing_service)
  • Updated integration tests to use new API

The new algorithm provides more granular chunking at the ledd (legal paragraph) level with proper handling of complex structures like lists and nested content.

Major improvements:
- Target chunk size of 512 tokens (optimized for RAG) vs 6800 previously
- 15% overlap between chunks for better context preservation
- Better structure preservation (lists, continuations, leddfortsettelse)
- Hierarchical context extraction (walks up XML tree for chapter/section info)
- Cross-reference extraction for graph-based retrieval
- Direct XML-to-chunks processing (removed intermediate parsing step)

Technical changes:
- Added new LovdataChunker class with three-tier fallback strategy
- Updated ChunkingService to use new chunker with chunk_file() API
- Simplified FileProcessingService by removing XMLParsingService dependency
- Updated pipeline factory with new chunking parameters
- Marked old components as deprecated (xml_chunker, recursive_splitter, xml_parsing_service)
- Updated integration tests to use new API

The new algorithm provides more granular chunking at the ledd (legal paragraph)
level with proper handling of complex structures like lists and nested content.
@martgra martgra merged commit 8aef9eb into main Nov 20, 2025
1 check failed
@martgra martgra deleted the claude/improve-xml-chunking-01AQutrcEUg2M2qv34a66azt branch November 22, 2025 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants