All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- HTML Escape:
TableService.format_as_html()now escapes cell content (html.escape()) to prevent XSS - Base64 Image Size Limit: HTML handler limits base64 image decode to 50 MB (
MAX_IMAGE_DECODE_SIZE) - Path Traversal Defense:
LocalStorageBackendandImageServicevalidate output paths stay within base directory - ZIP Bomb Defense: DOCX/PPTX/XLSX/HWPX converters check decompressed size against 1 GB threshold
- PDF Scan OCR: PDF default handler now renders scanned pages as images and inserts
[Image: ...]tags for OCR pipeline - DOC Native Parsing: FIB + Piece Table based text extraction, Table Stream parsing for
extract_tables(), OLE stream image extraction - PPT Table/Chart Extraction: OLE2 record-based table detection and chart extraction from PowerPoint binary format
- XLS Image/Chart Extraction: OLE stream image signature scanning, BIFF chart record parsing
- Removed libreoffice.py: Eliminated external tool dependency (133 LOC); all formats use native binary parsing
- Delegation Depth Limit:
_delegate_to()enforces max 3-level delegation depth to prevent infinite loops
- RTF Merged Cell Verification:
\clmgf,\clmrg,\clvmgf,\clvmrgsupport verified and tested - HWPX Chart Extraction: OOXML chart XML parsed inline during
extract_text() - Image Handler OCR: Integrated with OCR pipeline via
[Image: ...]tag insertion - Tesseract OCR Engine:
TesseractOCREngineimplemented (pytesseractwrapper) — local OCR without LLM - PPTX Group Shape Charts: Recursive shape traversal includes chart extraction within group shapes
- XLSX Chart Extraction: openpyxl chart API integration for
extract_charts() - OCR Prompt Language:
OCRConfig.prompt_languagesetting with"ko"/"en"prompt templates - Configurable Thresholds:
format_optionsforpdf.table_size,pptx.max_group_depth,csv.delimiter_candidates,doc.min_text_fragment_length
- 423 unit tests covering all core components:
ImageService(24 tests),ChartService(12 tests),MetadataService(10 tests)CachedDocumentProcessor(28 tests),AsyncDocumentProcessor(12 tests)- Delegation path tests (14 tests), Security tests (16 tests)
- Integration test framework with
conftest.py
- CSV Streaming:
max_rowsparameter limits in-memory row count;truncatedflag in output - XLSX Read-Only Mode:
format_options["xlsx"]["read_only"]for memory-efficient large file processing - OCR Parallel Processing:
OCRProcessor.max_workersparameter enablesThreadPoolExecutor-based parallel OCR - Shared ThreadPoolExecutor:
BaseHandler._timeout_executoris a class-level shared executor (lazy init, atexit cleanup) - CachedProcessor Extension:
process()andextract_chunks()now cacheable with JSON serialization - LRU Cache:
MemoryCacheBackendusesOrderedDict-based LRU eviction instead of FIFO - Image Size Limit:
ImageConfig.max_file_size_mbskips oversized images with warning
- Handler Comparison Table:
docs/handler_comparison.md— feature matrix for all 15 handlers - Configuration Reference:
docs/configuration.md— all config classes, options, defaults, examples - Error Codes Reference:
docs/error_codes.md— exception hierarchy, error codes, troubleshooting - OCR Setup Guide:
docs/ocr_guide.md— 6 engine setup, prompt customization, parallel processing - Plugin Development Guide:
docs/plugin_development.md— BaseHandler extension, 5-stage pipeline, testing - CHANGELOG v0.3.0: This changelog entry
- LangChain Integration:
ContextifierLoader(BaseLoader)incontextifier.integrations.langchain_loader- Single document / chunked mode, OCR support, lazy_load
- CI/CD Pipeline:
.github/workflows/ci.yml— lint, test matrix (Python 3.12/3.13), type-check, PyPI publish - Docker Support: Multi-stage
Dockerfilewith Tesseract OCR and Poppler - Password-Protected Files:
crypto_service.decrypt_if_encrypted()via msoffcrypto-toolextract_text(password=),process(password=),extract_chunks(password=)API
- License Review: PyMuPDF (AGPL-3.0) moved to optional
[pdf]extra; guarded imports - CSV Delimiter Confidence:
_detect_delimiter()returns(delimiter, confidence)tupleCsvParsedData.delimiter_confidencefield, exposed inPreprocessedData.properties
- EncodingConfig: New config class with
fallback_encodings,force_encoding,min_confidenceProcessingConfig.with_encoding()fluent API; wired through CSV/TSV/Text converters
v2 is a complete rewrite that is not backwards-compatible with v1. The package has moved from contextifier to contextifier_new.
- Enforced 5-stage pipeline: Convert → Preprocess → Metadata → Content → Postprocess
BaseHandlerenforces execution order — all handlers follow the same structure- Each stage is defined as an ABC:
Converter,Preprocessor,MetadataExtractor,ContentExtractor,Postprocessor
- 14 format handlers: PDF, PDF-Plus, DOCX, DOC, PPTX, PPT, XLSX, XLS, CSV/TSV, HWP, HWPX, RTF, Text, Image
- HandlerRegistry: Automatic extension → handler mapping via
register_defaults() - Immutable config system: Frozen dataclass-based
ProcessingConfigTagConfig,ImageConfig,ChartConfig,MetadataConfig,TableConfig,ChunkingConfig,OCRConfig- Fluent builder:
config.with_tags(),config.with_chunking(), ... - Serialization:
to_dict()/from_dict() - Format-specific options:
config.with_format_option("pdf", ...)
- 4 chunking strategies with automatic selection:
TableChunkingStrategy(priority 5) — spreadsheet-specificPageChunkingStrategy(priority 10) — page boundary-basedProtectedChunkingStrategy(priority 20) — HTML table / protected region preservationPlainChunkingStrategy(priority 100) — recursive splitting fallback
- 5 OCR engines: OpenAI, Anthropic, Google Gemini, AWS Bedrock, vLLM
- Convenience constructors:
from_api_key()for each engine - Direct LangChain client passthrough
- Custom prompt support
- Convenience constructors:
- 5 shared services (DI pattern):
TagService— page / slide / sheet tag generationImageService— image saving / tagging / deduplication / storage backendsChartService— chart data formattingTableService— table HTML / MD / Text renderingMetadataService— metadata formatting (Korean / English)
- Unified type system (
types.py):FileContextTypedDict — standard input for all handlersExtractionResult— unified text / metadata / table / image / chart outputDocumentMetadata,TableData,TableCell,ChartDatashared dataclassesFileCategory,OutputFormat,NamingStrategy,StorageTypeenums
- Unified exception hierarchy (
errors.py):ContextifierErrorbase exception treeFileNotFoundError,UnsupportedFormatError,HandlerNotFoundError, etc.
- ChunkResult:
save_to_md(),__len__,__iter__,__getitem__support - DOC handler auto-detection: OLE, HTML, DOCX, RTF internal format auto-detection
- All legacy v1 code in the
contextifierpackagecore/document_processor.py(monolithic single file)core/functions/(utils.py, individual processor modules)core/processor/(per-handler files without unified structure)chunking/(single chunking.py with all logic)ocr/ocr_engine/(per-engine files without consistency)
- Facade pattern:
DocumentProcessoris the sole public entry point - Strategy pattern: Automatic chunking strategy selection
- Template Method pattern:
BaseHandler.process()enforces the 5-stage order - Dependency Injection: Services are created once and shared across handlers
- Registry pattern: Automatic extension → handler mapping
requirements.txtpath resolution for packaged installs.
- Initial release of Contextifier v1.
- Support for PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, CSV, TSV, HWP, HWPX, RTF, TXT, Image.
- OCR integration via OpenAI, Anthropic, Gemini, Bedrock.
- Basic text chunking with page/table awareness.
- Metadata extraction for common document formats.