Skip to content

feat: expose CondensedTree via cluster_with_tree() + opt-in serde derives#46

Open
0Xjuanca wants to merge 1 commit into
tom-whitehead:masterfrom
0Xjuanca:feat/expose-condensed-tree-via-cluster-with-tree
Open

feat: expose CondensedTree via cluster_with_tree() + opt-in serde derives#46
0Xjuanca wants to merge 1 commit into
tom-whitehead:masterfrom
0Xjuanca:feat/expose-condensed-tree-via-cluster-with-tree

Conversation

@0Xjuanca
Copy link
Copy Markdown

@0Xjuanca 0Xjuanca commented May 9, 2026

Summary

Small additive change that exposes the condensed cluster tree as a public output of clustering, alongside opt-in serde derives. Default behaviour is unchanged; existing public API is preserved exactly.

Use case

I'm using hdbscan in a BERTopic-style topic-modeling pipeline that needs to assign new (previously-unseen) document embeddings to existing clusters without re-running the full algorithm. This is the same problem Python's hdbscan.approximate_predict solves: it walks the condensed tree from a new point's nearest training neighbour to find the smallest enclosing cluster.

For a Rust implementation, I need:

  1. The condensed tree from a fit (currently condense_tree is private and CondensedNode<T> is pub(crate)).
  2. Serde-serializability so the tree can be persisted as part of the fit artifact.

Changes

  • New public API: Hdbscan::cluster_with_tree() -> Result<(Vec<i32>, CondensedTree<T>), HdbscanError> returns both labels and the condensed tree from a single fit pass. cluster() is refactored to delegate to a private cluster_internal() helper shared with cluster_with_tree(); public cluster() signature is preserved exactly — no behaviour change for existing callers.
  • Visibility lift: CondensedNode<T> (in data_wrappers.rs) and the CondensedTree<T> type alias (in hdbscan.rs) elevated from pub(crate) to pub. CondensedNode<T> marked #[non_exhaustive] so additional fields can be added in future revisions without breaking external consumers.
  • New serde feature gate: opt-in serde derives on CondensedNode<T>. Off by default — no new dependencies pulled in. Downstream users opt in via features = ["serde"].
  • Test coverage: #[cfg(all(test, feature = "serde"))] bincode roundtrip on Vec<CondensedNode<f64>> verifies the generic-bound serde derives work for T: Float + Serialize.
  • Lib re-exports: CondensedNode and CondensedTree now exposed from the crate root.

Compatibility

  • Default features: unchanged. No new dependencies pulled in unless serde is enabled.
  • Existing cluster() callers: unchanged. Signature preserved exactly; refactor is internal-only.
  • cluster_par() callers: untouched. Symmetric expansion to cluster_par_with_tree() is straightforward but kept out of scope here for minimum-PR-surface; happy to land separately if you prefer.

LOC delta

~30 LOC across 4 files.

Test plan

  • cargo test — default features: 14 integration tests + 5 doc-tests, all pass
  • cargo test --features serde — serial + serde: existing tests + new bincode roundtrip, all pass
  • cargo build --features serde — clean compile, no warnings

Happy to iterate on naming (cluster_with_tree vs alternatives) or scope (e.g., add the parallel symmetry now if you'd prefer one PR).

…ives

Adds a public API for accessing the condensed cluster tree alongside an
opt-in `serde` feature gate, enabling custom inference on new points
(approximate_predict-style) in downstream applications.

* `Hdbscan::cluster_with_tree() -> Result<(Vec<i32>, CondensedTree<T>), HdbscanError>`
  is a new public method that returns both cluster labels and the
  condensed tree used internally to derive them. Mirrors the semantics
  of `cluster()` exactly; `cluster()` is refactored to delegate to a
  private internal helper, preserving its public signature unchanged.
* `CondensedNode<T>` and `CondensedTree<T>` alias elevated from
  `pub(crate)` to `pub`, with `CondensedNode<T>` marked
  `#[non_exhaustive]` for forward compatibility.
* New `serde` feature gate provides opt-in `Serialize` / `Deserialize`
  derives on `CondensedNode<T>` (off by default; no new deps pulled in
  unless enabled).
* `#[cfg(all(test, feature = "serde"))]` bincode roundtrip test on
  `Vec<CondensedNode<f64>>` verifies the generic-bound serde derives
  work for `T: Float + Serialize`.

Default features and `cluster()`/`cluster_par()` behaviour are unchanged.
@tom-whitehead
Copy link
Copy Markdown
Owner

Hey @0Xjuanca , thanks for raising a PR. There's actually a PR open at the moment from another contributor that exposes the condensed tree, among other things. Waiting for the contributor to make the requested changes, but if that one gets merged this PR won't be necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants