Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -340,6 +340,9 @@ Follow these guides to go from first chunk to production-ready ingestion.
Each link highlights a focused capability you can adopt incrementally.

- [Overview](docs/overview.md)
- [Documentation Index](docs/index.md)
- [Quick Start](docs/quickstart.md)
- [Configuration](docs/config.md)
- [Renderers](docs/renderers.md)
- [Metadata reference](docs/metadata.md)
- [Errors & troubleshooting](docs/errors.md)
53 changes: 53 additions & 0 deletions docs/errors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Errors & troubleshooting

This page summarizes the main exception types raised by Chunkana and provides practical guidance for resolving them.

## ChunkanaError (base class)

All Chunkana-specific errors inherit from `ChunkanaError`. The exception includes a human-readable message plus a `context` dictionary with debugging details.

**Tips**
- Log or print `error.get_context()` to see structured debugging data.
- Use this base class if you want to catch all Chunkana-specific failures in one place.

## ValidationError

Raised when chunk validation fails (for example, invalid line ranges or missing required fields). Validation errors usually include the `error_type`, a problematic `chunk_id` (when available), and a `suggested_fix`.

**Common causes**
- `start_line` or `end_line` is invalid or inverted.
- Required fields are missing in serialized chunk data.

**Tips**
- Re-run chunking with a clean source document to ensure line ranges are consistent.
- If you are deserializing chunks, validate that `start_line` and `end_line` are present and numeric.

## HierarchicalInvariantError

Raised when hierarchical chunk tree invariants are violated (for example, children IDs are missing from parents or leaf flags are inconsistent). This exception is emitted when hierarchical validation is enabled and `strict_mode=True`.

**Common causes**
- Manual modification of `parent_id` / `children_ids` after chunking.
- Missing `chunk_id` values when building a tree.
- Inconsistent `is_leaf` flags vs. `children_ids`.

**Tips**
- Keep `validate_invariants=True` but set `strict_mode=False` during investigation to log warnings instead of raising.
- Prefer using helper APIs (like `get_children`, `get_parent`, `get_siblings`) rather than mutating hierarchy metadata directly.

## ConfigurationError

Raised when configuration values are invalid or incompatible. The error includes the parameter name, the provided value, and possible valid values.

**Common causes**
- `overlap_size` is negative or larger than `max_chunk_size`.
- Strategy overrides that do not match known strategy names.

**Tips**
- Validate configuration early (when instantiating `ChunkConfig`) to catch issues before chunking.
- Use `strategy_override` only with `code_aware`, `list_aware`, `structural`, or `fallback`.

## Related docs

- [Debug mode](debug_mode.md)
- [Metadata reference](metadata.md)
34 changes: 34 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Documentation Index

Use this page as the main entry point to the Chunkana docs. Each link includes a quick summary so you can jump to the right guide fast.

## Getting started

- [Overview](overview.md) — what Chunkana is and the problems it solves.
- [Quick Start](quickstart.md) — minimal examples to chunk Markdown quickly.
- [Configuration](config.md) — all core config options and defaults.

## Concepts & behavior

- [Strategies](strategies.md) — how automatic strategy selection works and how to override it.
- [Renderers](renderers.md) — output formats (JSON, Dify-style metadata blocks, inline metadata).
- [Metadata reference](metadata.md) — `Chunk` fields and the meaning of key metadata properties.

## Debugging & errors

- [Debug mode](debug_mode.md) — metadata differences between standard vs hierarchical chunking.
- [Errors & troubleshooting](errors.md) — common exception types and how to resolve them.

## Integrations

- [Dify](integrations/dify.md) — renderer parity with Dify ingestion.
- [n8n](integrations/n8n.md) — automation-friendly pipeline setup.
- [Windmill](integrations/windmill.md) — batch workflows for document processing.

## Migration

- [Parity matrix](migration/parity_matrix.md) — mapping to dify-markdown-chunker behavior.

## Internal/roadmap

- [Documentation TODOs](TODO_DOCUMENTATION.md) — backlog of future documentation items.
56 changes: 56 additions & 0 deletions docs/metadata.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Metadata reference

This page explains the `Chunk` fields and the most important entries inside `chunk.metadata`.

## Chunk fields

Each chunk is a `Chunk` object with the following top-level fields:

- `content`: the Markdown content for the chunk (no overlap text is embedded).
- `start_line`: 1-indexed start line in the original document.
- `end_line`: 1-indexed end line in the original document.
- `size`: character length of `content`.
- `line_count`: number of lines in the chunk (`end_line - start_line + 1`).
- `metadata`: dictionary with retrieval and debugging metadata.

> **Line ranges can overlap.** When overlap is enabled, adjacent chunks can share line ranges because overlap context is stored in metadata. The line range always refers to the actual content inside `chunk.content`.

## Metadata fields

### `header_path`

A path-like string showing where the chunk sits in the document hierarchy, e.g. `/Guides/Setup`. It is built from the nearest header stack and is stable across sub-chunks in the same section.

### `content_type`

A high-level category for the chunk content. Typical values include:

- `text` — regular prose.
- `section` — section content produced by the structural strategy.
- `code` — code blocks.
- `table` — Markdown tables.
- `mixed` — mixed content (text + blocks).
- `preamble` — content before the first header.
- `document` — root chunk in hierarchical mode.

### `strategy`

The strategy that produced the chunk (for example `structural`, `code_aware`, `list_aware`, or `fallback`). Use this to trace why a chunk looks the way it does and to compare strategy behavior.

### Overlap metadata

When overlap is enabled (`overlap_size > 0`), chunks include context windows in metadata rather than embedding text directly:

- `previous_content`: overlap extracted from the end of the previous chunk.
- `next_content`: overlap extracted from the start of the next chunk.
- `overlap_size`: the actual number of characters stored in the overlap window (capped by the configured overlap ratio).

### `chunk_id`

A stable identifier added in hierarchical mode. It is used to build the chunk tree and to link parents/children/siblings. In hierarchical mode you will also see additional fields like `parent_id`, `children_ids`, `prev_sibling_id`, `next_sibling_id`, `is_leaf`, and `is_root`.

## Related docs

- [Overview](overview.md)
- [Debug mode](debug_mode.md)
- [Errors & troubleshooting](errors.md)
3 changes: 3 additions & 0 deletions docs/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,11 @@ Traditional splitters break Markdown structure and cause semantic drift. Chunkan

## Next steps

- [Documentation Index](index.md)
- [Quick Start](quickstart.md)
- [Configuration](config.md)
- [Strategies](strategies.md)
- [Renderers](renderers.md)
- [Integrations](integrations/dify.md)
- [Metadata reference](metadata.md)
- [Errors & troubleshooting](errors.md)