All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
0.2.26 - 2026-04-03
- HWPX: Extract text from shapes and improve section processing
- HWPX: Improve header/footer handling
- Bump version to 0.2.26
0.2.25 - 2026-03-28
- DOCX: Clean field codes and improve run element extraction
- DOC/DOCX: Add header and footer extraction
- Excel: HTML content detection in Excel cell processing
- PDF: Refine text extraction logic to exclude lines within table bounding boxes
0.2.24 - 2026-03-20
- Chunking: Implement small chunk merging to prevent table-title isolation
- Chunking: Allow backward merging when blocked by page boundaries
0.2.23 - 2026-03-15
- Excel: Enhance merged cell handling in XLS and XLSX HTML conversion
0.2.22 - 2026-03-10
- Update version retrieval mechanism in
__init__.py
0.2.21 - 2026-03-05
- Minor internal improvements and stabilization
0.2.20 - 2026-02-28
- Excel: Enhance XLSX and XLS layout detection to consider cells with borders as valid
0.2.18 - 2026-02-22
- Excel: Update HTML conversion to treat all cells as data cells without header distinction
0.2.17 - 2026-02-18
- Excel: Remove textbox and image segment extraction from
sheet_processorto prevent each image/textbox from occupying a separate chunk
- Excel: XLS and XLSX textbox extraction support
- Excel: Separate XLS and XLSX image handler refactoring
0.2.14 - 2026-02-12
- Chunking: Enhance
clean_chunksto merge page-marker-only chunks with next chunk (solves skipped page numbers)
0.2.13 - 2026-02-08
- Chunking: Support for nested tables (tables within tables within tables) in protected region detection
0.2.12 - 2026-02-05
- PDF: Adjust Y gap threshold for table merging in
TableDetectionEngineto prevent merging of separate tables
0.2.11 - 2026-02-02
- PDF: Refactor import statements in
pdf_table_detection.py
0.2.1 - 2026-01-30
- PDF: Enhance text extraction logic to handle table region extraction duplication problem
0.2.0 - 2026-01-28
- Improve file extension handling in
DocumentProcessor - Major version bump: stabilization of core API
0.1.5x - 2026-01-24 ~ 2026-01-27
- PDF: CJK compatibility handling and fragmented text reconstruction
- Excel: Table processing with context extraction and improved chunking logic (respects
chunk_size) - Chunking: Enhanced chunking logic for handling chunk size constraints
- PDF: Table quality validation criteria adjustment for paragraph text detection
0.1.4 - 2026-01-22
- Refactor: Adjust validation criteria for paragraph text detection in
TableQualityValidator - Improve comments and documentation across processors (Korean → English)
0.1.2 - 2026-01-20
- BedrockOCR: AWS Bedrock Vision model support for OCR processing
- Supports Claude 3.5 Sonnet and other Bedrock vision models
- Full AWS credential configuration (access key, secret key, session token, region)
- Configurable timeouts and retry settings
- ImageFileHandler: New handler for standalone image files (jpg, png, gif, bmp, webp)
- Automatically uses OCR engine when available
- Returns image tag format when OCR is not configured for later processing
- PageTagProcessor: Centralized page/slide/sheet tag processing system
- Unified tag generation across all document handlers
- Configurable tag prefixes and suffixes
- Image pattern support for OCR: Custom image tag patterns now passed to OCR engine
ImageProcessor.get_pattern_string()method for regex pattern generationBaseOCR.set_image_pattern()andset_image_pattern_from_string()methods- OCR engines now recognize custom image tag formats
- DocumentProcessor: OCR engine setter now invalidates handler registry for proper refresh
- Handler registry: ImageFileHandler automatically registered with OCR engine support
- QUICKSTART.md: Complete rewrite with comprehensive documentation
- 3-stage processing pipeline documentation (File → Text → OCR → Chunks)
- Detailed OCR configuration guide for all 5 engines
- Tag customization examples (image, page, slide, sheet)
- Complete API reference with all parameters
- All Korean comments and docstrings in
img_processor.pyconverted to English - Enhanced OCR integration with custom pattern matching support
- Better separation of concerns with PageTagProcessor
0.1.0 - 2026-01-19
- Initial release of xgen_doc2chunk
- Multi-format document support (PDF, DOCX, DOC, XLSX, XLS, PPTX, PPT, HWP, HWPX)
- Intelligent text extraction with structure preservation
- Table detection and extraction with HTML formatting
- OCR integration (OpenAI, Anthropic, Google Gemini, vLLM)
- Smart chunking with semantic awareness
- Metadata extraction
- Support for 20+ code file formats
- Korean document support (HWP, HWPX)
DocumentProcessorclass for easy document processing- Configurable chunk size and overlap
- Protected regions for code blocks
- Pluggable OCR engine architecture
- Automatic encoding detection for text files
- Chart and image extraction from Office documents