Skip to content

Releases: feyninc/chunk

v0.10.2

Choose a tag to compare

@chonk-lain chonk-lain released this 28 May 19:10
de1e128

What's Changed

New Contributors

Full Changelog: v0.10.1...v0.10.2

v0.10.1

Choose a tag to compare

@chonknick chonknick released this 30 Mar 06:13

What's New

.patterns() API now available in Python and JavaScript

The multi-byte pattern support from v0.10.0 is now exposed in all binding layers:

Python:

from chonkie_core import Chunker, chunk, chunk_offsets

# Composable with delimiters
chunks = list(Chunker(text, delimiters="\n.?!", patterns=["。", ",", "!"]))

# Convenience function
for c in chunk(text, delimiters=".", patterns=["。"]):
    print(bytes(c))

# Offsets
offsets = chunk_offsets(text, delimiters=".", patterns=["。"])

JavaScript:

import { chunk, chunk_offsets, Chunker } from '@chonkiejs/chunk';

// Generator
for (const c of chunk(text, { delimiters: ".", patterns: ["。", ","] })) { ... }

// Offsets
const offsets = chunk_offsets(text, { delimiters: ".", patterns: ["。"] });

// Class
const chunker = new Chunker(text, { delimiters: ".", patterns: ["。"] });

Also fixes an existing bug in the JS wrapper where consecutive and forwardFallback options weren't being passed through in the non-pattern code path.

v0.10.0

Choose a tag to compare

@chonknick chonknick released this 29 Mar 06:58

What's New

Multi-byte pattern support for chunking (Closes #2)

New .patterns(&[&str]) API on Chunker and OwnedChunker — composable with .delimiters() for mixed ASCII + multi-byte delimiter chunking.

chunk(content.as_bytes())
    .delimiters(b"\n.?!")
    .patterns(&["。", ",", "!"])
    .forward_fallback()
    .size(4096)

Highlights

  • Hybrid search strategy: memmem (SIMD) for 1-3 patterns, Aho-Corasick for 4+ — automatically selected
  • Zero regression: pure delimiter chunking stays at 70+ GiB/s
  • Composable: .delimiters() and .patterns() work together, picking the best split point across both
  • UTF-8 safe: multi-byte characters are never split mid-codepoint when using .patterns() + .forward_fallback()

Fixes

  • Fixed #2: passing multi-byte UTF-8 characters (e.g. CJK punctuation) to .delimiters() was decomposing them into individual bytes, causing mid-character splits

v0.9.3

Choose a tag to compare

@chonknick chonknick released this 21 Jan 09:15
  • Add Node.js support for WASM initialization
  • Detect Node.js environment and use fs to read .wasm file
  • Fix initSync call to use object parameter (new wasm-bindgen API)

v0.9.2

Choose a tag to compare

@chonknick chonknick released this 21 Jan 09:06

Add merge_splits function to WASM bindings

v0.9.1

Choose a tag to compare

@chonknick chonknick released this 21 Jan 09:01

Fix npm publish to include WASM files in pkg/ folder

v0.9.0

Choose a tag to compare

@chonknick chonknick released this 21 Jan 07:49

What's New

Features

  • Savitzky-Golay filter module for semantic chunking
    • savgol_filter: Signal smoothing and derivatives via polynomial fitting
    • find_local_minima_interpolated: Local minima detection using SavGol derivatives
    • windowed_cross_similarity: Cosine similarity for semantic chunking
    • filter_split_indices: Percentile filtering with minimum distance
  • NumPy array support in Python bindings for zero-copy performance

Previous (0.8.0)

  • Multi-byte pattern splitting with Aho-Corasick
  • Refactored merge API with string joining in Rust

v0.8.0

Choose a tag to compare

@chonknick chonknick released this 21 Jan 04:46

What's New

Token-aware Merging for RecursiveChunker

  • Added merge_splits function to Rust, Python, and JavaScript bindings
  • Equivalent to Chonkie's Cython _merge_splits function
  • Supports whitespace-aware merging (n-1 join tokens for n segments)

Usage

Rust:

use chunk::merge_splits;

let token_counts = vec![1, 1, 1, 1, 1, 1, 1];
let result = merge_splits(&token_counts, 3, false);
// result.indices = [3, 6, 7]
// result.token_counts = [3, 3, 1]

Python:

from chonkie_core import merge_splits

result = merge_splits([1, 1, 1, 1, 1, 1, 1], chunk_size=3)
# result.indices = [3, 6, 7]
# result.token_counts = [3, 3, 1]

JavaScript:

import { init, merge_splits } from '@chonkiejs/chunk';

await init();
const result = merge_splits([1, 1, 1, 1, 1, 1, 1], 3);
// result.indices = [3, 6, 7]
// result.tokenCounts = [3, 3, 1]

v0.7.0

Choose a tag to compare

@chonknick chonknick released this 21 Jan 04:01

What's New

  • New Feature: split_at_delimiters for delimiter-based text splitting

    • Splits at every delimiter occurrence (unlike chunk which is size-based)
    • Supports include_delim option: "prev", "next", or "none"
    • Supports min_chars for merging short segments
  • Refactored: Modular Rust crate structure (chunk.rs, split.rs, delim.rs)

  • Optimized: Single-pass min_chars merging and Vec pre-allocation

  • Python (chonkie-core): Added split_offsets()

  • JavaScript (@chonkiejs/chunk): Added split() and split_offsets()

  • Fixed: Updated bump-version script to include all version locations

v0.6.0

Choose a tag to compare

@chonknick chonknick released this 21 Jan 03:49

What's New

  • New Feature: split_at_delimiters for delimiter-based text splitting

    • Splits at every delimiter occurrence (unlike chunk which is size-based)
    • Supports include_delim option: "prev", "next", or "none"
    • Supports min_chars for merging short segments
  • Refactored: Modular Rust crate structure (chunk.rs, split.rs, delim.rs)

  • Optimized: Single-pass min_chars merging and Vec pre-allocation

  • Python (chonkie-core): Added split_offsets()

  • JavaScript (@chonkiejs/chunk): Added split() and split_offsets()