diff --git a/README.md b/README.md index 4c82aa4..5183c1f 100644 --- a/README.md +++ b/README.md @@ -340,6 +340,9 @@ Follow these guides to go from first chunk to production-ready ingestion. Each link highlights a focused capability you can adopt incrementally. - [Overview](docs/overview.md) +- [Documentation Index](docs/index.md) - [Quick Start](docs/quickstart.md) - [Configuration](docs/config.md) - [Renderers](docs/renderers.md) +- [Metadata reference](docs/metadata.md) +- [Errors & troubleshooting](docs/errors.md) diff --git a/docs/errors.md b/docs/errors.md new file mode 100644 index 0000000..6f049d5 --- /dev/null +++ b/docs/errors.md @@ -0,0 +1,53 @@ +# Errors & troubleshooting + +This page summarizes the main exception types raised by Chunkana and provides practical guidance for resolving them. + +## ChunkanaError (base class) + +All Chunkana-specific errors inherit from `ChunkanaError`. The exception includes a human-readable message plus a `context` dictionary with debugging details. + +**Tips** +- Log or print `error.get_context()` to see structured debugging data. +- Use this base class if you want to catch all Chunkana-specific failures in one place. + +## ValidationError + +Raised when chunk validation fails (for example, invalid line ranges or missing required fields). Validation errors usually include the `error_type`, a problematic `chunk_id` (when available), and a `suggested_fix`. + +**Common causes** +- `start_line` or `end_line` is invalid or inverted. +- Required fields are missing in serialized chunk data. + +**Tips** +- Re-run chunking with a clean source document to ensure line ranges are consistent. +- If you are deserializing chunks, validate that `start_line` and `end_line` are present and numeric. + +## HierarchicalInvariantError + +Raised when hierarchical chunk tree invariants are violated (for example, children IDs are missing from parents or leaf flags are inconsistent). This exception is emitted when hierarchical validation is enabled and `strict_mode=True`. + +**Common causes** +- Manual modification of `parent_id` / `children_ids` after chunking. +- Missing `chunk_id` values when building a tree. +- Inconsistent `is_leaf` flags vs. `children_ids`. + +**Tips** +- Keep `validate_invariants=True` but set `strict_mode=False` during investigation to log warnings instead of raising. +- Prefer using helper APIs (like `get_children`, `get_parent`, `get_siblings`) rather than mutating hierarchy metadata directly. + +## ConfigurationError + +Raised when configuration values are invalid or incompatible. The error includes the parameter name, the provided value, and possible valid values. + +**Common causes** +- `overlap_size` is negative or larger than `max_chunk_size`. +- Strategy overrides that do not match known strategy names. + +**Tips** +- Validate configuration early (when instantiating `ChunkConfig`) to catch issues before chunking. +- Use `strategy_override` only with `code_aware`, `list_aware`, `structural`, or `fallback`. + +## Related docs + +- [Debug mode](debug_mode.md) +- [Metadata reference](metadata.md) diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..63a24ca --- /dev/null +++ b/docs/index.md @@ -0,0 +1,34 @@ +# Documentation Index + +Use this page as the main entry point to the Chunkana docs. Each link includes a quick summary so you can jump to the right guide fast. + +## Getting started + +- [Overview](overview.md) — what Chunkana is and the problems it solves. +- [Quick Start](quickstart.md) — minimal examples to chunk Markdown quickly. +- [Configuration](config.md) — all core config options and defaults. + +## Concepts & behavior + +- [Strategies](strategies.md) — how automatic strategy selection works and how to override it. +- [Renderers](renderers.md) — output formats (JSON, Dify-style metadata blocks, inline metadata). +- [Metadata reference](metadata.md) — `Chunk` fields and the meaning of key metadata properties. + +## Debugging & errors + +- [Debug mode](debug_mode.md) — metadata differences between standard vs hierarchical chunking. +- [Errors & troubleshooting](errors.md) — common exception types and how to resolve them. + +## Integrations + +- [Dify](integrations/dify.md) — renderer parity with Dify ingestion. +- [n8n](integrations/n8n.md) — automation-friendly pipeline setup. +- [Windmill](integrations/windmill.md) — batch workflows for document processing. + +## Migration + +- [Parity matrix](migration/parity_matrix.md) — mapping to dify-markdown-chunker behavior. + +## Internal/roadmap + +- [Documentation TODOs](TODO_DOCUMENTATION.md) — backlog of future documentation items. diff --git a/docs/metadata.md b/docs/metadata.md new file mode 100644 index 0000000..d71f2a9 --- /dev/null +++ b/docs/metadata.md @@ -0,0 +1,56 @@ +# Metadata reference + +This page explains the `Chunk` fields and the most important entries inside `chunk.metadata`. + +## Chunk fields + +Each chunk is a `Chunk` object with the following top-level fields: + +- `content`: the Markdown content for the chunk (no overlap text is embedded). +- `start_line`: 1-indexed start line in the original document. +- `end_line`: 1-indexed end line in the original document. +- `size`: character length of `content`. +- `line_count`: number of lines in the chunk (`end_line - start_line + 1`). +- `metadata`: dictionary with retrieval and debugging metadata. + +> **Line ranges can overlap.** When overlap is enabled, adjacent chunks can share line ranges because overlap context is stored in metadata. The line range always refers to the actual content inside `chunk.content`. + +## Metadata fields + +### `header_path` + +A path-like string showing where the chunk sits in the document hierarchy, e.g. `/Guides/Setup`. It is built from the nearest header stack and is stable across sub-chunks in the same section. + +### `content_type` + +A high-level category for the chunk content. Typical values include: + +- `text` — regular prose. +- `section` — section content produced by the structural strategy. +- `code` — code blocks. +- `table` — Markdown tables. +- `mixed` — mixed content (text + blocks). +- `preamble` — content before the first header. +- `document` — root chunk in hierarchical mode. + +### `strategy` + +The strategy that produced the chunk (for example `structural`, `code_aware`, `list_aware`, or `fallback`). Use this to trace why a chunk looks the way it does and to compare strategy behavior. + +### Overlap metadata + +When overlap is enabled (`overlap_size > 0`), chunks include context windows in metadata rather than embedding text directly: + +- `previous_content`: overlap extracted from the end of the previous chunk. +- `next_content`: overlap extracted from the start of the next chunk. +- `overlap_size`: the actual number of characters stored in the overlap window (capped by the configured overlap ratio). + +### `chunk_id` + +A stable identifier added in hierarchical mode. It is used to build the chunk tree and to link parents/children/siblings. In hierarchical mode you will also see additional fields like `parent_id`, `children_ids`, `prev_sibling_id`, `next_sibling_id`, `is_leaf`, and `is_root`. + +## Related docs + +- [Overview](overview.md) +- [Debug mode](debug_mode.md) +- [Errors & troubleshooting](errors.md) diff --git a/docs/overview.md b/docs/overview.md index 785c4cf..4e1d02d 100644 --- a/docs/overview.md +++ b/docs/overview.md @@ -34,8 +34,11 @@ Traditional splitters break Markdown structure and cause semantic drift. Chunkan ## Next steps +- [Documentation Index](index.md) - [Quick Start](quickstart.md) - [Configuration](config.md) - [Strategies](strategies.md) - [Renderers](renderers.md) - [Integrations](integrations/dify.md) +- [Metadata reference](metadata.md) +- [Errors & troubleshooting](errors.md)