Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ USP: Structure-preserving chunking with rich metadata in a single pass.
- [Installation](#installation)
- [Quickstart](#quickstart)
- [Overview](#overview)
- [Why Chunkana](#why-chunkana)
- [Key Features](#key-features)
- [Supported Markdown constructs](#supported-markdown-constructs)
- [Requirements](#requirements)
Expand Down Expand Up @@ -82,6 +83,28 @@ for chunk in chunks:

Chunkana is a high-precision Markdown chunking library for RAG pipelines, search indexing, and LLM ingestion. It keeps semantic boundaries intact so downstream retrieval stays faithful to the original document. It is built for teams that need structure-preserving chunks at scale without custom parsers.

## Why Chunkana

### Concrete differentiators

- **Semantic guarantees**: never splits atomic structures (headers, lists, code fences, tables, LaTeX blocks) so each chunk stays valid and retrievable.
- **RAG metadata first**: every chunk carries header paths, line ranges, content type, overlap context, and strategy hints for filtering and ranking.
- **dify-markdown-chunker compatibility**: renderers emit Dify-style payloads without rewriting your ingestion pipeline.
- **Adaptive strategies**: auto-selects structure-, list-, or code-aware strategies to keep mixed documents coherent.

### Problem → Solution

- **Naive splitter breaks code fences** → **Chunkana keeps code blocks atomic and binds nearby context.**
- **Tables get fragmented and lose headers** → **Chunkana preserves tables as single semantic units.**
- **Lists lose hierarchy** → **Chunkana respects nested list structure.**
- **Math/LaTeX gets split mid-formula** → **Chunkana keeps formulas intact for retrieval.**

### Practical defaults & limits

- **Chunk size defaults**: `max_chunk_size=4096`, `min_chunk_size=512`.
- **Overlap metadata**: `overlap_size=200` characters, capped to 35% of adjacent chunk size for consistency.
- **Large-file throughput**: streaming APIs process multi-megabyte Markdown without loading everything into memory.

### Project Status / Compatibility / Support

- **Stability:** Beta (actively maintained, APIs may evolve as features land).
Expand Down