Skip to content

Latest commit

 

History

History
185 lines (137 loc) · 6.95 KB

File metadata and controls

185 lines (137 loc) · 6.95 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

0.2.26 - 2026-04-03

Added

  • HWPX: Extract text from shapes and improve section processing
  • HWPX: Improve header/footer handling

Changed

  • Bump version to 0.2.26

0.2.25 - 2026-03-28

Added

  • DOCX: Clean field codes and improve run element extraction
  • DOC/DOCX: Add header and footer extraction
  • Excel: HTML content detection in Excel cell processing

Changed

  • PDF: Refine text extraction logic to exclude lines within table bounding boxes

0.2.24 - 2026-03-20

Added

  • Chunking: Implement small chunk merging to prevent table-title isolation
  • Chunking: Allow backward merging when blocked by page boundaries

0.2.23 - 2026-03-15

Improved

  • Excel: Enhance merged cell handling in XLS and XLSX HTML conversion

0.2.22 - 2026-03-10

Changed

  • Update version retrieval mechanism in __init__.py

0.2.21 - 2026-03-05

Changed

  • Minor internal improvements and stabilization

0.2.20 - 2026-02-28

Improved

  • Excel: Enhance XLSX and XLS layout detection to consider cells with borders as valid

0.2.18 - 2026-02-22

Changed

  • Excel: Update HTML conversion to treat all cells as data cells without header distinction

0.2.17 - 2026-02-18

Changed

  • Excel: Remove textbox and image segment extraction from sheet_processor to prevent each image/textbox from occupying a separate chunk

Added

  • Excel: XLS and XLSX textbox extraction support
  • Excel: Separate XLS and XLSX image handler refactoring

0.2.14 - 2026-02-12

Fixed

  • Chunking: Enhance clean_chunks to merge page-marker-only chunks with next chunk (solves skipped page numbers)

0.2.13 - 2026-02-08

Added

  • Chunking: Support for nested tables (tables within tables within tables) in protected region detection

0.2.12 - 2026-02-05

Fixed

  • PDF: Adjust Y gap threshold for table merging in TableDetectionEngine to prevent merging of separate tables

0.2.11 - 2026-02-02

Changed

  • PDF: Refactor import statements in pdf_table_detection.py

0.2.1 - 2026-01-30

Fixed

  • PDF: Enhance text extraction logic to handle table region extraction duplication problem

0.2.0 - 2026-01-28

Changed

  • Improve file extension handling in DocumentProcessor
  • Major version bump: stabilization of core API

0.1.5x - 2026-01-24 ~ 2026-01-27

Added

  • PDF: CJK compatibility handling and fragmented text reconstruction
  • Excel: Table processing with context extraction and improved chunking logic (respects chunk_size)
  • Chunking: Enhanced chunking logic for handling chunk size constraints

Fixed

  • PDF: Table quality validation criteria adjustment for paragraph text detection

0.1.4 - 2026-01-22

Changed

  • Refactor: Adjust validation criteria for paragraph text detection in TableQualityValidator
  • Improve comments and documentation across processors (Korean → English)

0.1.2 - 2026-01-20

Added

  • BedrockOCR: AWS Bedrock Vision model support for OCR processing
    • Supports Claude 3.5 Sonnet and other Bedrock vision models
    • Full AWS credential configuration (access key, secret key, session token, region)
    • Configurable timeouts and retry settings
  • ImageFileHandler: New handler for standalone image files (jpg, png, gif, bmp, webp)
    • Automatically uses OCR engine when available
    • Returns image tag format when OCR is not configured for later processing
  • PageTagProcessor: Centralized page/slide/sheet tag processing system
    • Unified tag generation across all document handlers
    • Configurable tag prefixes and suffixes
  • Image pattern support for OCR: Custom image tag patterns now passed to OCR engine
    • ImageProcessor.get_pattern_string() method for regex pattern generation
    • BaseOCR.set_image_pattern() and set_image_pattern_from_string() methods
    • OCR engines now recognize custom image tag formats

Changed

  • DocumentProcessor: OCR engine setter now invalidates handler registry for proper refresh
  • Handler registry: ImageFileHandler automatically registered with OCR engine support
  • QUICKSTART.md: Complete rewrite with comprehensive documentation
    • 3-stage processing pipeline documentation (File → Text → OCR → Chunks)
    • Detailed OCR configuration guide for all 5 engines
    • Tag customization examples (image, page, slide, sheet)
    • Complete API reference with all parameters

Improved

  • All Korean comments and docstrings in img_processor.py converted to English
  • Enhanced OCR integration with custom pattern matching support
  • Better separation of concerns with PageTagProcessor

0.1.0 - 2026-01-19

Added

  • Initial release of xgen_doc2chunk
  • Multi-format document support (PDF, DOCX, DOC, XLSX, XLS, PPTX, PPT, HWP, HWPX)
  • Intelligent text extraction with structure preservation
  • Table detection and extraction with HTML formatting
  • OCR integration (OpenAI, Anthropic, Google Gemini, vLLM)
  • Smart chunking with semantic awareness
  • Metadata extraction
  • Support for 20+ code file formats
  • Korean document support (HWP, HWPX)

Features

  • DocumentProcessor class for easy document processing
  • Configurable chunk size and overlap
  • Protected regions for code blocks
  • Pluggable OCR engine architecture
  • Automatic encoding detection for text files
  • Chart and image extraction from Office documents