Skip to content

Releases: kreuzberg-dev/kreuzberg

v4.9.4

22 Apr 10:51
2ae9580

Choose a tag to compare

Fixed

  • Ruby gem build failure — add missing max_images_per_page field to ImageExtractionConfig initializer in Ruby binding (kreuzberg-rb), fixing compilation error E0063 on all platforms.
  • Node binding build failure on Linux — stop removing /usr/local/lib/node_modules in CI disk cleanup script; npm was being deleted before pnpm/action-setup could use it, causing spawn npm ENOENT.
  • Homebrew formula publish failure — grant contents: write permission to the publish-homebrew job so gh release upload can attach bottle artifacts (was contents: read).

Full Changelog: v4.9.3...v4.9.4

v4.9.3

22 Apr 06:37
373aabc

Choose a tag to compare

See CHANGELOG.md for full details.

v4.9.2

19 Apr 15:05
a3e8609

Choose a tag to compare

Fixed

  • Fix cancellation token not checked in WASM (non-tokio) path for Excel, DOC, PPT, Pages, Keynote, and Numbers extractors — cancellation was silently ignored in WASM builds
  • Propagate Cancelled error code (9) to all bindings — Go, C FFI, Python, TypeScript, Java, C#, and C API docs now include the new code
  • Fix PHP e2e embed tests calling instance methods statically — use procedural \Kreuzberg\embed() functions
  • Fix TypeScript e2e embed tests using wrong field names (type/namemodelType/value) for embedding model config
  • Fix Elixir e2e embed tests calling non-existent embed_async/2 — use sync embed/2
  • Fix TypeScript e2e generator missing html_output config mapping for styled HTML tests
  • Fix ORT_DYLIB_PATH on Windows CI pointing to lib/ instead of the actual DLL location
  • Fix C# CI build conditional to require successful FFI build
  • Add libuv1-dev to Linux CI system dependencies for R package builds

v4.9.1

19 Apr 04:59
3fe5736

Choose a tag to compare

Fixed

  • #754: Preserve _internal_bindings.pyi type stub during wheel artifact cleanup — published wheels now include inline type information for the core binding module
  • Add missing Default impl for PyCancellationToken to satisfy clippy new_without_default lint
  • Improve download resilience for eng.traineddata in build script — increase retries from 3 to 5, add fallback URL via raw.githubusercontent.com, and increase timeout to 300s
  • Increase Task installer retry resilience in CI — 5 attempts with --retry-all-errors curl flag

v4.9.0

18 Apr 12:18
e0f6db5

Choose a tag to compare

What's Changed

  • Fix duplicated heading in markdown chunker with prepend_heading_context by @tobocop2 in #701
  • chore(deps): bump pnpm/action-setup from 5 to 6 by @dependabot[bot] in #698
  • chore(deps): bump actions/upload-pages-artifact from 4 to 5 by @dependabot[bot] in #711
  • fix: remove duplicate output_format key and fix numeric types in OCR metadata by @kh3rld in #712
  • fix: LLM embedding provider panics in server mode by @Goldziher in #714
  • fix: honor ocr.enabled=false config and drop trailing newline in --format text by @kh3rld in #715
  • fix: assign correct page numbers to DOCX tables based on document position by @kh3rld in #718
  • fix: call enableOcr() in live demo and throw on missing Rust registry export by @kh3rld in #720
  • fix: clean up stale hf-hub lock files before embedding model downloads by @kh3rld in #721
  • fix: derive Default for LlmConfig to support struct-update syntax by @kh3rld in #727
  • chore(deps-dev): bump pypdf from 6.10.0 to 6.10.1 in the uv group across 1 directory by @dependabot[bot] in #724
  • fix: force include_elements=true in image OCR so pages[] is populated by @kh3rld in #723
  • docs: correct Cargo feature name from llm to liter-llm by @kh3rld in #729
  • fix: prevent doubled OCR content and page mismatch in image extractor by @kh3rld in #726
  • docs: theme-colored language icons, footer nav, and card formatting by @v-tan in #731
  • docs: dark-mode style fixes and content corrections by @v-tan in #734
  • fix: expose structured_output on Ruby Result by @kh3rld in #736
  • chore(deps-dev): bump pypdf from 6.10.1 to 6.10.2 in the uv group across 1 directory by @dependabot[bot] in #735
  • chore(deps-dev): update markitdown requirement from >=0.1.4 to >=0.1.5 by @dependabot[bot] in #742
  • chore(deps-dev): update maturin requirement from >=1.10.2 to >=1.13.1 by @dependabot[bot] in #743
  • chore(deps-dev): update requirements for rbs and steep in /packages/ruby by @dependabot[bot] in #744
  • chore(deps-dev): update prek requirement from >=0.2.21 to >=0.3.9 by @dependabot[bot] in #745
  • chore(deps-dev): update pymupdf4llm requirement from >=0.0.17 to >=1.27.2.2 by @dependabot[bot] in #746
  • chore(deps-dev): update pdftotext requirement from >=2.2.2 to >=3.0.0 by @dependabot[bot] in #747
  • feat: Add smart document chunking that splits by topic by @tobocop2 in #733

Full Changelog: v4.8.4...v4.9.0

v4.8.6

17 Apr 17:26
46641f5

Choose a tag to compare

What's Changed

  • Fix duplicated heading in markdown chunker with prepend_heading_context by @tobocop2 in #701
  • chore(deps): bump pnpm/action-setup from 5 to 6 by @dependabot[bot] in #698
  • chore(deps): bump actions/upload-pages-artifact from 4 to 5 by @dependabot[bot] in #711
  • fix: remove duplicate output_format key and fix numeric types in OCR metadata by @kh3rld in #712
  • fix: LLM embedding provider panics in server mode by @Goldziher in #714
  • fix: honor ocr.enabled=false config and drop trailing newline in --format text by @kh3rld in #715
  • fix: assign correct page numbers to DOCX tables based on document position by @kh3rld in #718
  • fix: call enableOcr() in live demo and throw on missing Rust registry export by @kh3rld in #720
  • fix: clean up stale hf-hub lock files before embedding model downloads by @kh3rld in #721
  • fix: derive Default for LlmConfig to support struct-update syntax by @kh3rld in #727
  • chore(deps-dev): bump pypdf from 6.10.0 to 6.10.1 in the uv group across 1 directory by @dependabot[bot] in #724
  • fix: force include_elements=true in image OCR so pages[] is populated by @kh3rld in #723
  • docs: correct Cargo feature name from llm to liter-llm by @kh3rld in #729
  • fix: prevent doubled OCR content and page mismatch in image extractor by @kh3rld in #726
  • docs: theme-colored language icons, footer nav, and card formatting by @v-tan in #731
  • docs: dark-mode style fixes and content corrections by @v-tan in #734
  • fix: expose structured_output on Ruby Result by @kh3rld in #736

Full Changelog: v4.8.4...v4.8.6

v4.8.5

14 Apr 11:51
c4d8f8f

Choose a tag to compare

What's Changed

Added

  • LLM usage tracking — new llm_usage field on ExtractionResult captures token counts, estimated cost (USD), model identifier, and finish reason for every LLM call (VLM OCR, structured extraction, LLM embeddings). Exposed across all 12 bindings.

Fixed

  • Markdown chunker heading duplication when prepend_heading_context is enabled (#701)
  • Helm chart icon 404 on Artifact Hub.png.svg
  • Python wheel manylinux compliance — bumped to manylinux_2_39
  • FFI memory leaksdjot_content_json, structured_output_json, llm_usage_json not freed
  • R e2e embed tests — missing type discriminator in generated config
  • Elixir parity testExtractionConfig missing html_output field
  • Go LLM e2e testsEmbeddingModelType missing LLM config support
  • WASM tree-sitter build — removed stale wasm feature gate for tslp 1.6.0
  • Ruby binding compilation — magnus type inference errors and missing llm_usage field

Full Changelog: v4.8.4...v4.8.5

v4.8.4

13 Apr 14:33
e44f4d8

Choose a tag to compare

What's Changed

Added

  • Helm chart for Kubernetes deployment — minimal, security-hardened Helm chart with Deployment, Service, Ingress, PVC, HPA, PDB, and ServiceAccount templates. Publishes to GHCR as an OCI artifact. (#695)
  • Helm lint and kubeconform pre-commit hooks — added helm lint --strict and kubeconform (k8s 1.28.0 schema validation) to pre-commit and CI pipeline.
  • Helm chart publish workflow — new publish-helm.yaml GitHub Actions workflow pushes versioned chart to oci://ghcr.io/kreuzberg-dev/charts.

Fixed

  • Helm chart: init container cannot chown as non-root — added securityContext.runAsUser: 0 to the init container.
  • Helm chart: unpinned busybox image tags — pinned to busybox:1.37-glibc for reproducibility.
  • Comrak bridge panics on multi-byte UTF-8 boundaries — annotation byte offsets landing inside multi-byte characters caused panics in build_inlines(). Snaps offsets to valid char boundaries. (#696)

Install via Helm

helm install kreuzberg oci://ghcr.io/kreuzberg-dev/charts/kreuzberg --version 4.8.4

Full Changelog: v4.8.3...v4.8.4

v4.8.2

10 Apr 08:55
v4.8.2
6eec15b

Choose a tag to compare

Added

  • HtmlOutputConfig typed in all bindingshtml_output config field (themes, CSS classes, embed CSS, custom CSS, class prefix) now fully typed in Python, TypeScript/Node, Go, Ruby, Elixir, PHP, Java, C#, R, and FFI. Previously only available in Rust core.

Fixed

  • PDF: legitimate repeated content stripped during page merging regardless of strip_repeating_text flagdeduplicate_paragraphs() runs unconditionally, stripping brand names and other legitimately repeated content even when ContentFilterConfig.strip_repeating_text is false. Gated both deduplication passes behind the flag (#670, #681)
  • R package build failure — R binding Cargo.toml version was stuck at 4.6.3 while core was at 4.8.1, causing tokio version resolution failure. Version sync script now includes the R native extension Cargo.toml.
  • CI: PyPI publish action failure — pinned pypa/gh-action-pypi-publish to v1.13.0 (v1.14.0 has broken Docker image on GHCR)
  • E2E: Elixir generator emitted undefined is_nan/1 function — added helper function definition to the generated Elixir test helpers

v4.8.1

09 Apr 12:12
7992470

Choose a tag to compare

Added

  • Styled HTML output — New HtmlOutputConfig on ExtractionConfig with 5 built-in themes (default, github, dark, light, unstyled), semantic kb-* CSS class hooks on every structural element, CSS custom properties (--kb-*), custom CSS injection (inline or file), and configurable class prefix. The existing Html output format is upgraded in-place when html_output is set (#633, #665)
  • 5 new CLI flags: --html-theme, --html-css, --html-css-file, --html-class-prefix, --html-no-embed-css — any flag implicitly sets --content-format html
  • HtmlOutputConfig and HtmlTheme types exposed in Rust public API

Changed

  • Vendored yake-rust 1.0.3 into kreuzberg core, removing external dependency
    • Fixes #676: BacktrackLimitExceeded panic on large files (10+ MB) by replacing regex-based sentence splitting with memchr-based approach
    • Expanded YAKE stopwords from 34 to 64 languages using kreuzberg's unified stopwords module
    • Removed 6 transitive dependencies (yake-rust, segtok, fancy-regex, streaming-stats, hashbrown, levenshtein)
  • Styled HTML renderer included in the html feature (no separate html-styled feature gate)

Fixed

  • PPTX: panic on non-char-boundary during page boundary recomputation — byte offsets could land inside multi-byte UTF-8 characters (e.g. U+2026), causing a panic when slicing content (#674)
  • PDF: include_headers / include_footers flags ignored by layout-model furniture stripping — when a layout-detection model classified paragraphs as PageHeader or PageFooter, they were unconditionally stripped as furniture regardless of ContentFilterConfig flag values. Setting strip_repeating_text=false with include_headers=true now correctly preserves those regions (#670)
  • PDF: heuristic table detector misclassifies body text as tables on slide-like PDFs — PowerPoint-exported PDFs with column-like text gaps produced false-positive 2–3 row "tables" whose bounding boxes covered the entire page, suppressing all body text from the structured extraction pipeline. Tables with ≤3 rows spanning >50% of the page height are now rejected as false positives
  • PPTX: ImageExtractionConfig.inject_placeholders silently ignored — setting inject_placeholders=false now correctly suppresses ![alt](target) image references in PPTX markdown output (#671, #677)
  • DOCX/HTML/DocBook/LaTeX/RST: inject_placeholders config ignored — all extractors now honour ImageExtractionConfig.inject_placeholders to suppress image reference injection when set to false
  • PPTX public API cleanupextract_pptx_from_path and extract_pptx_from_bytes now accept &PptxExtractionOptions instead of 6 positional parameters