Chalcedon

Fast, memory-efficient Butina clustering and train/validation/test splitting for chemical datasets. Use this package to minimize data leakage when splitting chemical data to improve the evaluation and generalizability of your models.

Installation

uv pip install chalcedon

Quick start

import chalcedon

smiles = [
    "CCO",
    "c1ccccc1",
    # ...your dataset
]

splits = chalcedon.butina_split(
    smiles,
    fractions={"train": 0.8, "val": 0.1, "test": 0.1},
    cutoff=0.65,
    dtype="float32" # or np.float32
)

train_smiles = splits["train"]
val_smiles = splits["val"]
test_smiles = splits["test"]

Using custom descriptors

We recommend dtype="float64" for non-binary descriptors, where dot-product magnitudes can exceed float32's exact range.

import chalcedon

descriptors = my_descriptor_generator(molecules)  # numpy.ndarray of shape (n, d)

cluster_ids = chalcedon.butina_cluster(descriptors, cutoff=0.65, dtype="float64")
splits = chalcedon.greedy_cluster_split(
    cluster_ids,
    fractions={"train": 0.8, "val": 0.1, "test": 0.1},
)

train_indices = splits["train"]  # numpy.ndarray of indices into `descriptors`

pairwise_tanimoto(fingerprints) is also exposed if you want just the similarity matrix.

Benchmarks

Chalcedon can quickly create Butina clusters of large chemical datasets on consumer hardware with near linear memory scaling.

See benchmarks/report.md for a detailed analysis of algorithm performance and benchmarks/ to reproduce results.

Citation

If you use Chalcedon in your research, please cite:

@software{chalcedon,
  title = {Chalcedon: Clustering and dataset splitting for chemical data.},
  year = {2026},
  url = {https://github.com/rowansci/chalcedon}
}

Acknowledgements

RDKit for cheminformatics infrastructure and the CrystalFF torsion library (Riniker & Landrum, J. Chem. Inf. Model. 56, 2016)
GEOM dataset for the benchmark SMILES (Axelrod & Gomez-Bombarelli, Sci Data 9, 185, 2022)

This package was created with Cookiecutter and the jevandezande/uv-cookiecutter project template.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
chalcedon		chalcedon
tests		tests
.coveragerc		.coveragerc
.editorconfig		.editorconfig
.envrc		.envrc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Chalcedon

Installation

Quick start

Recommended

Using custom descriptors

Benchmarks

Citation

Acknowledgements

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Chalcedon

Installation

Quick start

Recommended

Using custom descriptors

Benchmarks

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages