Releases: feyninc/chunk
Release list
v0.10.2
What's Changed
- feat: publish PyEmscripten (Pyodide) wheels to PyPI by @chonk-lain in #4
- chore: bump version by @chonk-lain in #5
- chore: fix failing CI by @chonk-lain in #6
New Contributors
- @chonk-lain made their first contribution in #4
Full Changelog: v0.10.1...v0.10.2
v0.10.1
What's New
.patterns() API now available in Python and JavaScript
The multi-byte pattern support from v0.10.0 is now exposed in all binding layers:
Python:
from chonkie_core import Chunker, chunk, chunk_offsets
# Composable with delimiters
chunks = list(Chunker(text, delimiters="\n.?!", patterns=["。", ",", "!"]))
# Convenience function
for c in chunk(text, delimiters=".", patterns=["。"]):
print(bytes(c))
# Offsets
offsets = chunk_offsets(text, delimiters=".", patterns=["。"])JavaScript:
import { chunk, chunk_offsets, Chunker } from '@chonkiejs/chunk';
// Generator
for (const c of chunk(text, { delimiters: ".", patterns: ["。", ","] })) { ... }
// Offsets
const offsets = chunk_offsets(text, { delimiters: ".", patterns: ["。"] });
// Class
const chunker = new Chunker(text, { delimiters: ".", patterns: ["。"] });Also fixes an existing bug in the JS wrapper where consecutive and forwardFallback options weren't being passed through in the non-pattern code path.
v0.10.0
What's New
Multi-byte pattern support for chunking (Closes #2)
New .patterns(&[&str]) API on Chunker and OwnedChunker — composable with .delimiters() for mixed ASCII + multi-byte delimiter chunking.
chunk(content.as_bytes())
.delimiters(b"\n.?!")
.patterns(&["。", ",", "!"])
.forward_fallback()
.size(4096)Highlights
- Hybrid search strategy: memmem (SIMD) for 1-3 patterns, Aho-Corasick for 4+ — automatically selected
- Zero regression: pure delimiter chunking stays at 70+ GiB/s
- Composable:
.delimiters()and.patterns()work together, picking the best split point across both - UTF-8 safe: multi-byte characters are never split mid-codepoint when using
.patterns()+.forward_fallback()
Fixes
- Fixed #2: passing multi-byte UTF-8 characters (e.g. CJK punctuation) to
.delimiters()was decomposing them into individual bytes, causing mid-character splits
v0.9.3
v0.9.2
v0.9.1
v0.9.0
What's New
Features
- Savitzky-Golay filter module for semantic chunking
savgol_filter: Signal smoothing and derivatives via polynomial fittingfind_local_minima_interpolated: Local minima detection using SavGol derivativeswindowed_cross_similarity: Cosine similarity for semantic chunkingfilter_split_indices: Percentile filtering with minimum distance
- NumPy array support in Python bindings for zero-copy performance
Previous (0.8.0)
- Multi-byte pattern splitting with Aho-Corasick
- Refactored merge API with string joining in Rust
v0.8.0
What's New
Token-aware Merging for RecursiveChunker
- Added
merge_splitsfunction to Rust, Python, and JavaScript bindings - Equivalent to Chonkie's Cython
_merge_splitsfunction - Supports whitespace-aware merging (n-1 join tokens for n segments)
Usage
Rust:
use chunk::merge_splits;
let token_counts = vec![1, 1, 1, 1, 1, 1, 1];
let result = merge_splits(&token_counts, 3, false);
// result.indices = [3, 6, 7]
// result.token_counts = [3, 3, 1]Python:
from chonkie_core import merge_splits
result = merge_splits([1, 1, 1, 1, 1, 1, 1], chunk_size=3)
# result.indices = [3, 6, 7]
# result.token_counts = [3, 3, 1]JavaScript:
import { init, merge_splits } from '@chonkiejs/chunk';
await init();
const result = merge_splits([1, 1, 1, 1, 1, 1, 1], 3);
// result.indices = [3, 6, 7]
// result.tokenCounts = [3, 3, 1]v0.7.0
What's New
-
New Feature:
split_at_delimitersfor delimiter-based text splitting- Splits at every delimiter occurrence (unlike
chunkwhich is size-based) - Supports
include_delimoption: "prev", "next", or "none" - Supports
min_charsfor merging short segments
- Splits at every delimiter occurrence (unlike
-
Refactored: Modular Rust crate structure (
chunk.rs,split.rs,delim.rs) -
Optimized: Single-pass min_chars merging and Vec pre-allocation
-
Python (
chonkie-core): Addedsplit_offsets() -
JavaScript (
@chonkiejs/chunk): Addedsplit()andsplit_offsets() -
Fixed: Updated bump-version script to include all version locations
v0.6.0
What's New
-
New Feature:
split_at_delimitersfor delimiter-based text splitting- Splits at every delimiter occurrence (unlike
chunkwhich is size-based) - Supports
include_delimoption: "prev", "next", or "none" - Supports
min_charsfor merging short segments
- Splits at every delimiter occurrence (unlike
-
Refactored: Modular Rust crate structure (
chunk.rs,split.rs,delim.rs) -
Optimized: Single-pass min_chars merging and Vec pre-allocation
-
Python (
chonkie-core): Addedsplit_offsets() -
JavaScript (
@chonkiejs/chunk): Addedsplit()andsplit_offsets()