Releases · kreuzberg-dev/kreuzberg · GitHub

22 Apr 10:51

Goldziher

v4.9.4 Latest

Latest

Fixed

Ruby gem build failure — add missing max_images_per_page field to ImageExtractionConfig initializer in Ruby binding (kreuzberg-rb), fixing compilation error E0063 on all platforms.
Node binding build failure on Linux — stop removing /usr/local/lib/node_modules in CI disk cleanup script; npm was being deleted before pnpm/action-setup could use it, causing spawn npm ENOENT.
Homebrew formula publish failure — grant contents: write permission to the publish-homebrew job so gh release upload can attach bottle artifacts (was contents: read).

Full Changelog: v4.9.3...v4.9.4

Assets 35

c-ffi-linux-aarch64.tar.gz

sha256:54b8ac9c0f14546a5dd9dc531eab37c8b5955b015e30d77abda4e61b9f8232c9

181 MB 2026-04-22T12:23:07Z
c-ffi-linux-x86_64.tar.gz

sha256:808a7f466a006cdc325cbc36969eae73ea47c259961f98f0e6f07e5fb5ed9654

182 MB 2026-04-22T12:23:13Z
c-ffi-macos-arm64.tar.gz

sha256:9035146ebfa24c3b54e22fb5a456d9c524de344029f2213718ff211170bb69c0

173 MB 2026-04-22T12:23:20Z
c-ffi-windows-x86_64.tar.gz

sha256:b08a020d4484fd5385142e592b83453c85664eae530a28d8e7a85244df4509f8

159 MB 2026-04-22T12:23:31Z
go-ffi-linux-aarch64.tar.gz

sha256:0914a98b9cd134605b9e2c497a52f2749eaccfd56958e93be4fdb9d7f52a4f8d

184 MB 2026-04-22T12:33:33Z
go-ffi-linux-x86_64.tar.gz

sha256:43a7b0373127ac1f450eaf4dfca18d30796a09633d787211d2b53fd68922e99c

185 MB 2026-04-22T12:33:40Z
go-ffi-macos-arm64.tar.gz

sha256:4ea202a151b55818ea0f63351f551c6c484baf20e60241859cdec4498414068b

175 MB 2026-04-22T12:33:48Z
go-ffi-windows-x86_64.tar.gz

sha256:190110de33785afea831d5947c640522934bdd7c922b45d780c751d2d6fbd385

159 MB 2026-04-22T12:33:56Z
kreuzberg-4.9.4-linux-arm64.tar.gz

sha256:f89848d9ed9c9b8d1ce45dde6b6b5b3a88a8690057b202ee4b8be44b63e7caf6

28.2 MB 2026-04-22T12:08:38Z
kreuzberg-4.9.4-linux-arm64.tar.gz.sha256

sha256:31da3505dc58bf79f2b8f07b898583edbba6dfdda56a1727bc3e2932e33335c8

101 Bytes 2026-04-22T12:08:40Z
Source code (zip)

2026-04-22T10:48:34Z
Source code (tar.gz)

2026-04-22T10:48:34Z

22 Apr 06:37

Goldziher

v4.9.3

See CHANGELOG.md for full details.

Assets 35

19 Apr 15:05

Goldziher

v4.9.2

Fixed

Fix cancellation token not checked in WASM (non-tokio) path for Excel, DOC, PPT, Pages, Keynote, and Numbers extractors — cancellation was silently ignored in WASM builds
Propagate Cancelled error code (9) to all bindings — Go, C FFI, Python, TypeScript, Java, C#, and C API docs now include the new code
Fix PHP e2e embed tests calling instance methods statically — use procedural \Kreuzberg\embed() functions
Fix TypeScript e2e embed tests using wrong field names (type/name → modelType/value) for embedding model config
Fix Elixir e2e embed tests calling non-existent embed_async/2 — use sync embed/2
Fix TypeScript e2e generator missing html_output config mapping for styled HTML tests
Fix ORT_DYLIB_PATH on Windows CI pointing to lib/ instead of the actual DLL location
Fix C# CI build conditional to require successful FFI build
Add libuv1-dev to Linux CI system dependencies for R package builds

Assets 35

19 Apr 04:59

Goldziher

v4.9.1

Fixed

#754: Preserve _internal_bindings.pyi type stub during wheel artifact cleanup — published wheels now include inline type information for the core binding module
Add missing Default impl for PyCancellationToken to satisfy clippy new_without_default lint
Improve download resilience for eng.traineddata in build script — increase retries from 3 to 5, add fallback URL via raw.githubusercontent.com, and increase timeout to 300s
Increase Task installer retry resilience in CI — 5 attempts with --retry-all-errors curl flag

Assets 35

18 Apr 12:18

Goldziher

v4.9.0

What's Changed

Fix duplicated heading in markdown chunker with prepend_heading_context by @tobocop2 in #701
chore(deps): bump pnpm/action-setup from 5 to 6 by @dependabot[bot] in #698
chore(deps): bump actions/upload-pages-artifact from 4 to 5 by @dependabot[bot] in #711
fix: remove duplicate output_format key and fix numeric types in OCR metadata by @kh3rld in #712
fix: LLM embedding provider panics in server mode by @Goldziher in #714
fix: honor ocr.enabled=false config and drop trailing newline in --format text by @kh3rld in #715
fix: assign correct page numbers to DOCX tables based on document position by @kh3rld in #718
fix: call enableOcr() in live demo and throw on missing Rust registry export by @kh3rld in #720
fix: clean up stale hf-hub lock files before embedding model downloads by @kh3rld in #721
fix: derive Default for LlmConfig to support struct-update syntax by @kh3rld in #727
chore(deps-dev): bump pypdf from 6.10.0 to 6.10.1 in the uv group across 1 directory by @dependabot[bot] in #724
fix: force include_elements=true in image OCR so pages[] is populated by @kh3rld in #723
docs: correct Cargo feature name from llm to liter-llm by @kh3rld in #729
fix: prevent doubled OCR content and page mismatch in image extractor by @kh3rld in #726
docs: theme-colored language icons, footer nav, and card formatting by @v-tan in #731
docs: dark-mode style fixes and content corrections by @v-tan in #734
fix: expose structured_output on Ruby Result by @kh3rld in #736
chore(deps-dev): bump pypdf from 6.10.1 to 6.10.2 in the uv group across 1 directory by @dependabot[bot] in #735
chore(deps-dev): update markitdown requirement from >=0.1.4 to >=0.1.5 by @dependabot[bot] in #742
chore(deps-dev): update maturin requirement from >=1.10.2 to >=1.13.1 by @dependabot[bot] in #743
chore(deps-dev): update requirements for rbs and steep in /packages/ruby by @dependabot[bot] in #744
chore(deps-dev): update prek requirement from >=0.2.21 to >=0.3.9 by @dependabot[bot] in #745
chore(deps-dev): update pymupdf4llm requirement from >=0.0.17 to >=1.27.2.2 by @dependabot[bot] in #746
chore(deps-dev): update pdftotext requirement from >=2.2.2 to >=3.0.0 by @dependabot[bot] in #747
feat: Add smart document chunking that splits by topic by @tobocop2 in #733

Full Changelog: v4.8.4...v4.9.0

Contributors

tobocop2, v-tan, and 3 other contributors

Assets 30

17 Apr 17:26

Goldziher

v4.8.6

What's Changed

Fix duplicated heading in markdown chunker with prepend_heading_context by @tobocop2 in #701
chore(deps): bump pnpm/action-setup from 5 to 6 by @dependabot[bot] in #698
chore(deps): bump actions/upload-pages-artifact from 4 to 5 by @dependabot[bot] in #711
fix: remove duplicate output_format key and fix numeric types in OCR metadata by @kh3rld in #712
fix: LLM embedding provider panics in server mode by @Goldziher in #714
fix: honor ocr.enabled=false config and drop trailing newline in --format text by @kh3rld in #715
fix: assign correct page numbers to DOCX tables based on document position by @kh3rld in #718
fix: call enableOcr() in live demo and throw on missing Rust registry export by @kh3rld in #720
fix: clean up stale hf-hub lock files before embedding model downloads by @kh3rld in #721
fix: derive Default for LlmConfig to support struct-update syntax by @kh3rld in #727
chore(deps-dev): bump pypdf from 6.10.0 to 6.10.1 in the uv group across 1 directory by @dependabot[bot] in #724
fix: force include_elements=true in image OCR so pages[] is populated by @kh3rld in #723
docs: correct Cargo feature name from llm to liter-llm by @kh3rld in #729
fix: prevent doubled OCR content and page mismatch in image extractor by @kh3rld in #726
docs: theme-colored language icons, footer nav, and card formatting by @v-tan in #731
docs: dark-mode style fixes and content corrections by @v-tan in #734
fix: expose structured_output on Ruby Result by @kh3rld in #736

Full Changelog: v4.8.4...v4.8.6

Contributors

tobocop2, v-tan, and 3 other contributors

Assets 35

14 Apr 11:51

Goldziher

v4.8.5

What's Changed

Added

LLM usage tracking — new llm_usage field on ExtractionResult captures token counts, estimated cost (USD), model identifier, and finish reason for every LLM call (VLM OCR, structured extraction, LLM embeddings). Exposed across all 12 bindings.

Fixed

Markdown chunker heading duplication when prepend_heading_context is enabled (#701)
Helm chart icon 404 on Artifact Hub — .png → .svg
Python wheel manylinux compliance — bumped to manylinux_2_39
FFI memory leaks — djot_content_json, structured_output_json, llm_usage_json not freed
R e2e embed tests — missing type discriminator in generated config
Elixir parity test — ExtractionConfig missing html_output field
Go LLM e2e tests — EmbeddingModelType missing LLM config support
WASM tree-sitter build — removed stale wasm feature gate for tslp 1.6.0
Ruby binding compilation — magnus type inference errors and missing llm_usage field

Full Changelog: v4.8.4...v4.8.5

Assets 35

13 Apr 14:33

Goldziher

v4.8.4

What's Changed

Added

Helm chart for Kubernetes deployment — minimal, security-hardened Helm chart with Deployment, Service, Ingress, PVC, HPA, PDB, and ServiceAccount templates. Publishes to GHCR as an OCI artifact. (#695)
Helm lint and kubeconform pre-commit hooks — added helm lint --strict and kubeconform (k8s 1.28.0 schema validation) to pre-commit and CI pipeline.
Helm chart publish workflow — new publish-helm.yaml GitHub Actions workflow pushes versioned chart to oci://ghcr.io/kreuzberg-dev/charts.

Fixed

Helm chart: init container cannot chown as non-root — added securityContext.runAsUser: 0 to the init container.
Helm chart: unpinned busybox image tags — pinned to busybox:1.37-glibc for reproducibility.
Comrak bridge panics on multi-byte UTF-8 boundaries — annotation byte offsets landing inside multi-byte characters caused panics in build_inlines(). Snaps offsets to valid char boundaries. (#696)

Install via Helm

helm install kreuzberg oci://ghcr.io/kreuzberg-dev/charts/kreuzberg --version 4.8.4

Full Changelog: v4.8.3...v4.8.4

Assets 35

10 Apr 08:55

Goldziher

v4.8.2

Added

HtmlOutputConfig typed in all bindings — html_output config field (themes, CSS classes, embed CSS, custom CSS, class prefix) now fully typed in Python, TypeScript/Node, Go, Ruby, Elixir, PHP, Java, C#, R, and FFI. Previously only available in Rust core.

Fixed

PDF: legitimate repeated content stripped during page merging regardless of strip_repeating_text flag — deduplicate_paragraphs() runs unconditionally, stripping brand names and other legitimately repeated content even when ContentFilterConfig.strip_repeating_text is false. Gated both deduplication passes behind the flag (#670, #681)
R package build failure — R binding Cargo.toml version was stuck at 4.6.3 while core was at 4.8.1, causing tokio version resolution failure. Version sync script now includes the R native extension Cargo.toml.
CI: PyPI publish action failure — pinned pypa/gh-action-pypi-publish to v1.13.0 (v1.14.0 has broken Docker image on GHCR)
E2E: Elixir generator emitted undefined is_nan/1 function — added helper function definition to the generated Elixir test helpers

Assets 28

09 Apr 12:12

Goldziher

v4.8.1

Added

Styled HTML output — New HtmlOutputConfig on ExtractionConfig with 5 built-in themes (default, github, dark, light, unstyled), semantic kb-* CSS class hooks on every structural element, CSS custom properties (--kb-*), custom CSS injection (inline or file), and configurable class prefix. The existing Html output format is upgraded in-place when html_output is set (#633, #665)
5 new CLI flags: --html-theme, --html-css, --html-css-file, --html-class-prefix, --html-no-embed-css — any flag implicitly sets --content-format html
HtmlOutputConfig and HtmlTheme types exposed in Rust public API

Changed

Vendored yake-rust 1.0.3 into kreuzberg core, removing external dependency
- Fixes #676: BacktrackLimitExceeded panic on large files (10+ MB) by replacing regex-based sentence splitting with memchr-based approach
- Expanded YAKE stopwords from 34 to 64 languages using kreuzberg's unified stopwords module
- Removed 6 transitive dependencies (yake-rust, segtok, fancy-regex, streaming-stats, hashbrown, levenshtein)
Styled HTML renderer included in the html feature (no separate html-styled feature gate)

Fixed

PPTX: panic on non-char-boundary during page boundary recomputation — byte offsets could land inside multi-byte UTF-8 characters (e.g. … U+2026), causing a panic when slicing content (#674)
PDF: include_headers / include_footers flags ignored by layout-model furniture stripping — when a layout-detection model classified paragraphs as PageHeader or PageFooter, they were unconditionally stripped as furniture regardless of ContentFilterConfig flag values. Setting strip_repeating_text=false with include_headers=true now correctly preserves those regions (#670)
PDF: heuristic table detector misclassifies body text as tables on slide-like PDFs — PowerPoint-exported PDFs with column-like text gaps produced false-positive 2–3 row "tables" whose bounding boxes covered the entire page, suppressing all body text from the structured extraction pipeline. Tables with ≤3 rows spanning >50% of the page height are now rejected as false positives
PPTX: ImageExtractionConfig.inject_placeholders silently ignored — setting inject_placeholders=false now correctly suppresses ![alt](target) image references in PPTX markdown output (#671, #677)
DOCX/HTML/DocBook/LaTeX/RST: inject_placeholders config ignored — all extractors now honour ImageExtractionConfig.inject_placeholders to suppress image reference injection when set to false
PPTX public API cleanup — extract_pptx_from_path and extract_pptx_from_bytes now accept &PptxExtractionOptions instead of 6 positional parameters

Assets 35