Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
169 changes: 83 additions & 86 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,39 @@
# Chunkana

Intelligent Markdown chunking library for RAG systems.
[![GitHub Repository](https://img.shields.io/badge/GitHub-Chunkana-181717?logo=github)](https://github.com/asukhodko/chunkana)
[![PyPI version](https://img.shields.io/pypi/v/chunkana.svg)](https://pypi.org/project/chunkana/)
[![Python versions](https://img.shields.io/pypi/pyversions/chunkana.svg)](https://pypi.org/project/chunkana/)
[![License](https://img.shields.io/pypi/l/chunkana.svg)](LICENSE)
[![Downloads](https://img.shields.io/pypi/dm/chunkana.svg)](https://pypi.org/project/chunkana/)

## Features
**Chunkana** is a high-precision Markdown chunking library for RAG pipelines, search indexing, and LLM ingestion. It produces semantically correct Markdown chunks by respecting headers, code blocks, tables, and LaTeX while keeping the output retrieval-ready.

- 🧠 **Smart chunking**: Automatically selects optimal strategy based on content
- 📦 **Atomic blocks**: Preserves code blocks, tables, and LaTeX formulas
- 🌳 **Hierarchical**: Navigate chunks by header structure with tree invariant validation
- 📊 **Rich metadata**: Header paths, content types, overlap context
- 🔄 **Streaming**: Process large files (>10MB) efficiently
- 🎯 **Multiple renderers**: JSON, inline metadata, Dify-compatible
- ✅ **Quality assurance**: Automatic dangling header prevention and micro-chunk minimization
If you're looking for a **semantic Markdown chunker**, **Markdown splitter**, or **Markdown document segmenter** that preserves structure for LLM context windows, Chunkana is built for exactly that.

## Why Chunkana

Chunkana turns messy Markdown into clean, structured chunks that retain meaning:

- **Semantic correctness**: preserves headers, lists, tables, code blocks, and math without splitting them mid-block.
- **RAG-ready metadata**: header paths, content types, line ranges, and overlap context.
- **Smart strategy selection**: automatically adapts to code-heavy, list-heavy, or structural documents.
- **Hierarchical navigation**: build a chunk tree for section-aware retrieval.
- **Streaming for large files**: chunk multi-megabyte documents without loading everything into memory.
- **Compatibility**: output formats for Dify and JSON APIs.

## Installation

```bash
pip install chunkana
```

## Quick Start
Optional extras:

```bash
pip install "chunkana[docs]"
```

## Quick start

```python
from chunkana import chunk_markdown
Expand All @@ -42,10 +57,12 @@ def hello():

chunks = chunk_markdown(text)
for chunk in chunks:
print(f"Lines {chunk.start_line}-{chunk.end_line}: {chunk.metadata['header_path']}")
print(f"{chunk.start_line}-{chunk.end_line}: {chunk.metadata['header_path']}")
```

## Configuration
## Usage examples

### 1) Tune chunk sizes and overlap

```python
from chunkana import chunk_markdown, ChunkerConfig
Expand All @@ -59,120 +76,100 @@ config = ChunkerConfig(
chunks = chunk_markdown(text, config)
```

### Hierarchical Chunking Configuration

For hierarchical chunking with tree structure validation:
### 2) Build a hierarchical chunk tree

```python
from chunkana import MarkdownChunker, ChunkConfig

config = ChunkConfig(
max_chunk_size=1000,
min_chunk_size=100,
overlap_size=100,
validate_invariants=True, # Enable tree invariant validation (default: True)
strict_mode=False, # Auto-fix violations vs raise exceptions (default: False)
)

chunker = MarkdownChunker(config)
chunker = MarkdownChunker(ChunkConfig(validate_invariants=True))
result = chunker.chunk_hierarchical(text)

# Navigate the hierarchy
root = result.get_chunk(result.root_id)
children = result.get_children(result.root_id)
flat_chunks = result.get_flat_chunks()
flat_chunks = result.get_flat_chunks() # leaf + significant parent chunks
```

**Configuration options:**
- `validate_invariants` (default: `True`): Validates tree invariants after construction
- `strict_mode` (default: `False`): When `True`, raises exceptions on invariant violations; when `False`, auto-fixes issues and logs warnings

## Exception Handling

Chunkana provides a hierarchy of exceptions for error handling:
### 3) Stream large Markdown files

```python
from chunkana import (
ChunkanaError, # Base exception for all chunkana errors
HierarchicalInvariantError, # Tree structure violations
ValidationError, # Validation failures
ConfigurationError, # Invalid configuration
TreeConstructionError, # Tree building failures
)
from chunkana import MarkdownChunker

try:
result = chunker.chunk_hierarchical(text)
except HierarchicalInvariantError as e:
print(f"Invariant violation: {e.invariant}")
print(f"Chunk ID: {e.chunk_id}")
print(f"Suggested fix: {e.suggested_fix}")
except ChunkanaError as e:
print(f"Chunking error: {e}")
chunker = MarkdownChunker()
for chunk in chunker.chunk_file_streaming("docs/handbook.md"):
print(chunk.metadata["chunk_index"], chunk.size)
```

## Renderers
### 4) Emit Dify-compatible output

```python
from chunkana import chunk_markdown
from chunkana.renderers import render_dify_style, render_json
from chunkana.renderers import render_dify_style

chunks = chunk_markdown(text)

# JSON output
json_output = render_json(chunks)

# Dify-compatible format
dify_output = render_dify_style(chunks)
output = render_dify_style(chunks)
```

## Quality Features

### Dangling Header Prevention

Chunkana automatically prevents headers from being separated from their content. When a chunk would end with a header (like `#### Details`), the header is moved to the next chunk to maintain semantic coherence.

### Micro-Chunk Minimization

Small chunks are intelligently merged with adjacent content when they lack structural significance, reducing fragmentation while preserving important standalone elements like code blocks and tables.
### 5) Adaptive chunk sizing for mixed documents

### Tree Invariant Validation

Hierarchical chunking validates:
- **is_leaf consistency**: Leaf status matches children presence
- **Parent-child bidirectionality**: All relationships are symmetric
- **No orphaned chunks**: Every chunk is reachable from root

### Line Range Contract (Hierarchical Mode)
```python
from chunkana import chunk_markdown, ChunkerConfig
from chunkana.adaptive_sizing import AdaptiveSizeConfig

In hierarchical chunking mode, `start_line` and `end_line` follow a specific contract:
config = ChunkerConfig(
use_adaptive_sizing=True,
adaptive_config=AdaptiveSizeConfig(
base_size=1500,
code_weight=0.4,
min_size=500,
max_size=8000,
),
)

- **Leaf nodes**: Line range covers only the chunk's own content
- **Internal nodes**: Line range covers only the node's own content (not children)
- **Root node**: Line range covers the entire document (1 to last line)
chunks = chunk_markdown(text, config)
```

**Important**: The sum of children's line ranges does NOT equal the parent's range. The parent contains only its "header" content, while children contain detailed content. This is by design for hierarchical navigation.
## Renderers

```python
result = chunker.chunk_hierarchical(text)
root = result.get_chunk(result.root_id)
from chunkana.renderers import (
render_dify_style,
render_json,
render_inline_metadata,
render_with_embedded_overlap,
)
```

# Root covers entire document
print(f"Root: lines {root.start_line}-{root.end_line}")
- **render_dify_style** — `<metadata>` blocks for Dify.
- **render_json** — list of dictionaries for JSON APIs.
- **render_inline_metadata** — HTML comment metadata inline.
- **render_with_embedded_overlap** — injects overlap into text for RAG windows.

# Children cover their own sections
for child in result.get_children(result.root_id):
print(f"Child: lines {child.start_line}-{child.end_line}")
```
## Integrations

- [Dify](docs/integrations/dify.md)
- [n8n](docs/integrations/n8n.md)
- [Windmill](docs/integrations/windmill.md)

## Documentation

- [Overview](docs/overview.md)
- [Quick Start](docs/quickstart.md)
- [Configuration](docs/config.md)
- [Strategies](docs/strategies.md)
- [Renderers](docs/renderers.md)
- [Debug Mode](docs/debug_mode.md)
- [Migration Guide](MIGRATION_GUIDE.md)

## FAQ

**Q: What makes Chunkana different from a basic Markdown splitter?**

Chunkana is a **semantic Markdown chunker** that keeps structure intact (headers, lists, code blocks, tables, LaTeX) and enriches each chunk with retrieval metadata. This yields more accurate search and RAG results than naive line-based splitting.

**Q: Does Chunkana work for RAG and LLM ingestion?**

Yes. Chunkana is optimized for **RAG chunking**, **LLM context window preparation**, and **semantic Markdown segmentation**. It provides overlap metadata and consistent hierarchy paths for retrieval pipelines.

## License

MIT
42 changes: 26 additions & 16 deletions docs/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Chunkana uses `ChunkerConfig` (alias: `ChunkConfig`) to control chunking behavior.

## Basic Parameters
## Basic parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
Expand All @@ -12,17 +12,17 @@ Chunkana uses `ChunkerConfig` (alias: `ChunkConfig`) to control chunking behavio
| `preserve_atomic_blocks` | bool | True | Keep code blocks, tables, LaTeX intact |
| `extract_preamble` | bool | True | Extract content before first header as preamble |

## Strategy Selection Thresholds
## Strategy selection thresholds

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `code_threshold` | float | 0.3 | Code ratio threshold for CodeAware strategy |
| `structure_threshold` | int | 3 | Minimum headers for Structural strategy |
| `list_ratio_threshold` | float | 0.4 | List content ratio for ListAware strategy |
| `list_count_threshold` | int | 5 | Minimum lists for ListAware strategy |
| `strategy_override` | str\|None | None | Force specific strategy: "code_aware", "list_aware", "structural", "fallback" |
| `strategy_override` | str\|None | None | Force strategy: "code_aware", "list_aware", "structural", "fallback" |

## Code-Context Binding
## Code-context binding

These parameters control how code blocks are bound to surrounding explanations:

Expand All @@ -35,7 +35,7 @@ These parameters control how code blocks are bound to surrounding explanations:
| `bind_output_blocks` | bool | True | Bind code with its output blocks |
| `preserve_before_after_pairs` | bool | True | Keep before/after code pairs together |

## Adaptive Sizing
## Adaptive sizing

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
Expand All @@ -55,7 +55,7 @@ adaptive_config = AdaptiveSizeConfig(
)
```

## Table Grouping
## Table grouping

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
Expand All @@ -73,7 +73,7 @@ table_config = TableGroupingConfig(
)
```

## Overlap Behavior
## Overlap behavior

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
Expand All @@ -82,21 +82,21 @@ table_config = TableGroupingConfig(

The overlap is stored in metadata (`previous_content`, `next_content`), not embedded in `chunk.content`.

## LaTeX Handling
## LaTeX handling

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `preserve_latex_blocks` | bool | True | Keep LaTeX blocks intact |

When enabled, LaTeX blocks (`$$...$$`, `\[...\]`, `\begin{...}...\end{...}`) are treated as atomic units.

## Computed Fields
## Computed fields

| Field | Description |
|-------|-------------|
| `enable_overlap` | Computed as `overlap_size > 0` |

## Factory Methods
## Factory methods

```python
from chunkana import ChunkerConfig
Expand All @@ -120,9 +120,19 @@ config = ChunkerConfig.from_dict(config_dict)

Round-trip is guaranteed: `ChunkerConfig.from_dict(config.to_dict()) == config`

## Example Configurations
## Recommended presets

### Documentation Sites
### RAG pipelines

```python
config = ChunkerConfig(
max_chunk_size=4096,
min_chunk_size=512,
overlap_size=200,
)
```

### Documentation sites

```python
config = ChunkerConfig(
Expand All @@ -133,7 +143,7 @@ config = ChunkerConfig(
)
```

### Code Repositories
### Code repositories

```python
config = ChunkerConfig(
Expand All @@ -145,7 +155,7 @@ config = ChunkerConfig(
)
```

### Changelogs / Release Notes
### Changelogs / release notes

```python
config = ChunkerConfig(
Expand All @@ -156,7 +166,7 @@ config = ChunkerConfig(
)
```

### Scientific Documents (LaTeX)
### Scientific documents (LaTeX)

```python
config = ChunkerConfig(
Expand All @@ -166,6 +176,6 @@ config = ChunkerConfig(
)
```

## Plugin Compatibility
## Plugin compatibility

All 17 fields from dify-markdown-chunker's `ChunkConfig` are supported. See [Parity Matrix](migration/parity_matrix.md) for details.
Loading