Releases: kreuzberg-dev/kreuzberg
Releases · kreuzberg-dev/kreuzberg
v4.9.4
Fixed
- Ruby gem build failure — add missing
max_images_per_pagefield toImageExtractionConfiginitializer in Ruby binding (kreuzberg-rb), fixing compilation error E0063 on all platforms. - Node binding build failure on Linux — stop removing
/usr/local/lib/node_modulesin CI disk cleanup script; npm was being deleted beforepnpm/action-setupcould use it, causingspawn npm ENOENT. - Homebrew formula publish failure — grant
contents: writepermission to thepublish-homebrewjob sogh release uploadcan attach bottle artifacts (wascontents: read).
Full Changelog: v4.9.3...v4.9.4
v4.9.3
See CHANGELOG.md for full details.
v4.9.2
Fixed
- Fix cancellation token not checked in WASM (non-tokio) path for Excel, DOC, PPT, Pages, Keynote, and Numbers extractors — cancellation was silently ignored in WASM builds
- Propagate
Cancellederror code (9) to all bindings — Go, C FFI, Python, TypeScript, Java, C#, and C API docs now include the new code - Fix PHP e2e embed tests calling instance methods statically — use procedural
\Kreuzberg\embed()functions - Fix TypeScript e2e embed tests using wrong field names (
type/name→modelType/value) for embedding model config - Fix Elixir e2e embed tests calling non-existent
embed_async/2— use syncembed/2 - Fix TypeScript e2e generator missing
html_outputconfig mapping for styled HTML tests - Fix
ORT_DYLIB_PATHon Windows CI pointing tolib/instead of the actual DLL location - Fix C# CI build conditional to require successful FFI build
- Add
libuv1-devto Linux CI system dependencies for R package builds
v4.9.1
Fixed
- #754: Preserve
_internal_bindings.pyitype stub during wheel artifact cleanup — published wheels now include inline type information for the core binding module - Add missing
Defaultimpl forPyCancellationTokento satisfy clippynew_without_defaultlint - Improve download resilience for
eng.traineddatain build script — increase retries from 3 to 5, add fallback URL viaraw.githubusercontent.com, and increase timeout to 300s - Increase Task installer retry resilience in CI — 5 attempts with
--retry-all-errorscurl flag
v4.9.0
What's Changed
- Fix duplicated heading in markdown chunker with prepend_heading_context by @tobocop2 in #701
- chore(deps): bump pnpm/action-setup from 5 to 6 by @dependabot[bot] in #698
- chore(deps): bump actions/upload-pages-artifact from 4 to 5 by @dependabot[bot] in #711
- fix: remove duplicate output_format key and fix numeric types in OCR metadata by @kh3rld in #712
- fix: LLM embedding provider panics in server mode by @Goldziher in #714
- fix: honor ocr.enabled=false config and drop trailing newline in --format text by @kh3rld in #715
- fix: assign correct page numbers to DOCX tables based on document position by @kh3rld in #718
- fix: call enableOcr() in live demo and throw on missing Rust registry export by @kh3rld in #720
- fix: clean up stale hf-hub lock files before embedding model downloads by @kh3rld in #721
- fix: derive Default for LlmConfig to support struct-update syntax by @kh3rld in #727
- chore(deps-dev): bump pypdf from 6.10.0 to 6.10.1 in the uv group across 1 directory by @dependabot[bot] in #724
- fix: force include_elements=true in image OCR so pages[] is populated by @kh3rld in #723
- docs: correct Cargo feature name from llm to liter-llm by @kh3rld in #729
- fix: prevent doubled OCR content and page mismatch in image extractor by @kh3rld in #726
- docs: theme-colored language icons, footer nav, and card formatting by @v-tan in #731
- docs: dark-mode style fixes and content corrections by @v-tan in #734
- fix: expose structured_output on Ruby Result by @kh3rld in #736
- chore(deps-dev): bump pypdf from 6.10.1 to 6.10.2 in the uv group across 1 directory by @dependabot[bot] in #735
- chore(deps-dev): update markitdown requirement from >=0.1.4 to >=0.1.5 by @dependabot[bot] in #742
- chore(deps-dev): update maturin requirement from >=1.10.2 to >=1.13.1 by @dependabot[bot] in #743
- chore(deps-dev): update requirements for rbs and steep in /packages/ruby by @dependabot[bot] in #744
- chore(deps-dev): update prek requirement from >=0.2.21 to >=0.3.9 by @dependabot[bot] in #745
- chore(deps-dev): update pymupdf4llm requirement from >=0.0.17 to >=1.27.2.2 by @dependabot[bot] in #746
- chore(deps-dev): update pdftotext requirement from >=2.2.2 to >=3.0.0 by @dependabot[bot] in #747
- feat: Add smart document chunking that splits by topic by @tobocop2 in #733
Full Changelog: v4.8.4...v4.9.0
v4.8.6
What's Changed
- Fix duplicated heading in markdown chunker with prepend_heading_context by @tobocop2 in #701
- chore(deps): bump pnpm/action-setup from 5 to 6 by @dependabot[bot] in #698
- chore(deps): bump actions/upload-pages-artifact from 4 to 5 by @dependabot[bot] in #711
- fix: remove duplicate output_format key and fix numeric types in OCR metadata by @kh3rld in #712
- fix: LLM embedding provider panics in server mode by @Goldziher in #714
- fix: honor ocr.enabled=false config and drop trailing newline in --format text by @kh3rld in #715
- fix: assign correct page numbers to DOCX tables based on document position by @kh3rld in #718
- fix: call enableOcr() in live demo and throw on missing Rust registry export by @kh3rld in #720
- fix: clean up stale hf-hub lock files before embedding model downloads by @kh3rld in #721
- fix: derive Default for LlmConfig to support struct-update syntax by @kh3rld in #727
- chore(deps-dev): bump pypdf from 6.10.0 to 6.10.1 in the uv group across 1 directory by @dependabot[bot] in #724
- fix: force include_elements=true in image OCR so pages[] is populated by @kh3rld in #723
- docs: correct Cargo feature name from llm to liter-llm by @kh3rld in #729
- fix: prevent doubled OCR content and page mismatch in image extractor by @kh3rld in #726
- docs: theme-colored language icons, footer nav, and card formatting by @v-tan in #731
- docs: dark-mode style fixes and content corrections by @v-tan in #734
- fix: expose structured_output on Ruby Result by @kh3rld in #736
Full Changelog: v4.8.4...v4.8.6
v4.8.5
What's Changed
Added
- LLM usage tracking — new
llm_usagefield onExtractionResultcaptures token counts, estimated cost (USD), model identifier, and finish reason for every LLM call (VLM OCR, structured extraction, LLM embeddings). Exposed across all 12 bindings.
Fixed
- Markdown chunker heading duplication when
prepend_heading_contextis enabled (#701) - Helm chart icon 404 on Artifact Hub —
.png→.svg - Python wheel manylinux compliance — bumped to
manylinux_2_39 - FFI memory leaks —
djot_content_json,structured_output_json,llm_usage_jsonnot freed - R e2e embed tests — missing
typediscriminator in generated config - Elixir parity test —
ExtractionConfigmissinghtml_outputfield - Go LLM e2e tests —
EmbeddingModelTypemissing LLM config support - WASM tree-sitter build — removed stale
wasmfeature gate for tslp 1.6.0 - Ruby binding compilation — magnus type inference errors and missing
llm_usagefield
Full Changelog: v4.8.4...v4.8.5
v4.8.4
What's Changed
Added
- Helm chart for Kubernetes deployment — minimal, security-hardened Helm chart with Deployment, Service, Ingress, PVC, HPA, PDB, and ServiceAccount templates. Publishes to GHCR as an OCI artifact. (#695)
- Helm lint and kubeconform pre-commit hooks — added
helm lint --strictandkubeconform(k8s 1.28.0 schema validation) to pre-commit and CI pipeline. - Helm chart publish workflow — new
publish-helm.yamlGitHub Actions workflow pushes versioned chart tooci://ghcr.io/kreuzberg-dev/charts.
Fixed
- Helm chart: init container cannot chown as non-root — added
securityContext.runAsUser: 0to the init container. - Helm chart: unpinned busybox image tags — pinned to
busybox:1.37-glibcfor reproducibility. - Comrak bridge panics on multi-byte UTF-8 boundaries — annotation byte offsets landing inside multi-byte characters caused panics in
build_inlines(). Snaps offsets to valid char boundaries. (#696)
Install via Helm
helm install kreuzberg oci://ghcr.io/kreuzberg-dev/charts/kreuzberg --version 4.8.4Full Changelog: v4.8.3...v4.8.4
v4.8.2
Added
HtmlOutputConfigtyped in all bindings —html_outputconfig field (themes, CSS classes, embed CSS, custom CSS, class prefix) now fully typed in Python, TypeScript/Node, Go, Ruby, Elixir, PHP, Java, C#, R, and FFI. Previously only available in Rust core.
Fixed
- PDF: legitimate repeated content stripped during page merging regardless of
strip_repeating_textflag —deduplicate_paragraphs()runs unconditionally, stripping brand names and other legitimately repeated content even whenContentFilterConfig.strip_repeating_textisfalse. Gated both deduplication passes behind the flag (#670, #681) - R package build failure — R binding Cargo.toml version was stuck at 4.6.3 while core was at 4.8.1, causing tokio version resolution failure. Version sync script now includes the R native extension Cargo.toml.
- CI: PyPI publish action failure — pinned
pypa/gh-action-pypi-publishto v1.13.0 (v1.14.0 has broken Docker image on GHCR) - E2E: Elixir generator emitted undefined
is_nan/1function — added helper function definition to the generated Elixir test helpers
v4.8.1
Added
- Styled HTML output — New
HtmlOutputConfigonExtractionConfigwith 5 built-in themes (default,github,dark,light,unstyled), semantickb-*CSS class hooks on every structural element, CSS custom properties (--kb-*), custom CSS injection (inline or file), and configurable class prefix. The existingHtmloutput format is upgraded in-place whenhtml_outputis set (#633, #665) - 5 new CLI flags:
--html-theme,--html-css,--html-css-file,--html-class-prefix,--html-no-embed-css— any flag implicitly sets--content-format html HtmlOutputConfigandHtmlThemetypes exposed in Rust public API
Changed
- Vendored yake-rust 1.0.3 into kreuzberg core, removing external dependency
- Fixes #676:
BacktrackLimitExceededpanic on large files (10+ MB) by replacing regex-based sentence splitting with memchr-based approach - Expanded YAKE stopwords from 34 to 64 languages using kreuzberg's unified stopwords module
- Removed 6 transitive dependencies (yake-rust, segtok, fancy-regex, streaming-stats, hashbrown, levenshtein)
- Fixes #676:
- Styled HTML renderer included in the
htmlfeature (no separatehtml-styledfeature gate)
Fixed
- PPTX: panic on non-char-boundary during page boundary recomputation — byte offsets could land inside multi-byte UTF-8 characters (e.g.
…U+2026), causing a panic when slicing content (#674) - PDF:
include_headers/include_footersflags ignored by layout-model furniture stripping — when a layout-detection model classified paragraphs asPageHeaderorPageFooter, they were unconditionally stripped as furniture regardless ofContentFilterConfigflag values. Settingstrip_repeating_text=falsewithinclude_headers=truenow correctly preserves those regions (#670) - PDF: heuristic table detector misclassifies body text as tables on slide-like PDFs — PowerPoint-exported PDFs with column-like text gaps produced false-positive 2–3 row "tables" whose bounding boxes covered the entire page, suppressing all body text from the structured extraction pipeline. Tables with ≤3 rows spanning >50% of the page height are now rejected as false positives
- PPTX:
ImageExtractionConfig.inject_placeholderssilently ignored — settinginject_placeholders=falsenow correctly suppressesimage references in PPTX markdown output (#671, #677) - DOCX/HTML/DocBook/LaTeX/RST:
inject_placeholdersconfig ignored — all extractors now honourImageExtractionConfig.inject_placeholdersto suppress image reference injection when set tofalse - PPTX public API cleanup —
extract_pptx_from_pathandextract_pptx_from_bytesnow accept&PptxExtractionOptionsinstead of 6 positional parameters