Yewang Chen, Junfeng Li, Shuyin Xia, Qinghong Lai, Xinbo Gao, Guoyin Wang, Dongdong Cheng, Yi Liu, Yi Wang
Granular-ball Skeleton Clustering (GBSK) is a scalable clustering algorithm designed for large-scale data. By constructing multi-grained granular-balls from sampled data, it approximates the underlying structure of data as a compact "skeleton," reducing computation while maintaining accuracy. With linear time complexity (O(n)), GBSK handles massive datasetsβup to 100M points in 256 dimension. The adaptive variant, AGBSK, simplifies parameter tuning for ease of use.
GBSK is currently implemented in MATLAB. The code requires MATLAB version R2021a or later.
-
Clone the repositoy:
git clone https://github.com/XFastDataLab/GBSK.git
-
Add the
GBSK/directory of the repository to your MATLAB path:addpath(genpath('path_to_GBSK'));
Run demo/demo1.m or demo/demo2.m. Run demo/ClusteringQualityEvaluation.py to get evaluation results.
-
Prepare your dataset in
.mator.txtformat, ensuring it contains a variable representing an$n \times d$ matrix (n instances, d feature). -
Place your dataset in the
datasets/directoy. -
Navigate to the
algorithms/directory in MATLAB. -
Run the main scrit:
main.mfor a dataset that does not exceed memory. Otherwise, edit and runmain_big.m.
AGBSK works well with default parameters, requiring only the cluster number k as input:
| Parameter | Description | Default Value |
|---|---|---|
| k | Number of clusters (root key balls representing centers) | User-specified |
| s | Number of sample sets | 30 |
| alpha | Sampling proportion | |
| M | Number of balls per sample set |
For expert users, GBSK offers four adjustable parameters with empirically validated ranges:
| Parameter | Recommended Range | Notes |
|---|---|---|
| s | 10-30 | Number of sample sets |
| alpha | Dataset-dependent (e.g. 0.01) | Ensure sufficient sampling coverage |
| k | Based on prior knowledge | Number of target clusters |
| M | k < M <= n | Number of balls per sample set |
Upon execution, results are stored in the experiment_records/ directory, organized by dataset name and parameter settings. Each run includes:
labels.txt: Cluster labels assigned to each data pont.aggRepBallCenters.txt: Centers of representative balls.keyBallCenters.txt: Centers of final key balls.log.txt: Detailed log of the run, including timing and parameter settigs.
GBSK is designed for efficiency and scalabiity:
-
Time Complexity:
$O(n)$ , significantly reduced compared to traditional clustering methods on large dataets. - Scalability: Capable of handling datasets with millions of instances.
- Robustness: Effective in the presence of noise and outliers due to multi-sampling and density-based techniques.
GBSK has been tested on diverse synthetic and real-world datasets:
| Dataset | Instances | Dims | Classes | Type | Size | Usage | Label | Description |
|---|---|---|---|---|---|---|---|---|
| Chainlink | 1,000 | 3 | 2 | S | 20KB | Q | Y | Two interlocking 3D rings |
| Twenty | 1,000 | 2 | 20 | S | 17KB | Q | Y | 20 evenly distributed clusters |
| SYN1 | 2,000 | 2 | 5 | S | 30KB | Q | Y | Varying cluster densities |
| Segmentation | 2,310 | 18 | 7 | R | 246KB | Q | Y | Image segmentation data |
| SYN3 | 3,000 | 2 | 2 | S | 4KB | Q | Y | Mixed density clusters |
| EngyTime | 4,096 | 2 | 2 | S | 76KB | Q | Y | Gaussian distributions with overlap |
| Waveform | 5,000 | 21 | 3 | R | 520KB | Q | Y | Physics waveform data |
| S3 | 5,000 | 2 | 15 | S | 136KB | Q | Y | 15 Gaussian clusters |
| SYN2 | 5,800 | 2 | 7 | S | 7KB | Q | Y | Non-convex shapes |
| Pendigits | 10,992 | 16 | 10 | R | 685KB | Q,S | Y | Handwritten digits |
| DryBean | 13,611 | 16 | 7 | R | 3.6MB | Q,S | Y | Bean classification |
| CIFAR-10 | 60,000 | 3,072 | 10 | R | 174MB | Q,S | Y | Image classification |
| MNIST | 70,000 | 784 | 10 | R | 121MB | Q,S | Y | Handwritten digits |
| MoCap | 78,095 | 36 | 5 | R | 43MB | Q,S | Y | Motion capture data |
| CoverType | 581K | 54 | 7 | R | 71MB | Q,S | Y | Forest cover types |
| 3M2D5 | 3M | 2 | 5 | S | 54MB | Q,S | Y | Large Gaussian mixtures |
| N-BaIoT | 7M | 115 | 9 | R | 7.6GB | S | N | IoT malware traffic |
| MNIST8M | 8.1M | 784 | 10 | R | 5.9GB | Q,S | Y | Infinite MNIST |
| AGC100M | 100M | 256 | 17 | S | 95GB | Q,S | Y | Ultra-large-scale benchmark |
- Type: S = Synthetic, R = Real-world
- Usage: Q = Quality evaluation, S = Speed testing
- Label: Y = Labeled, N = Unlabeled
This project is licensed under the GNU General Public License v3.0.
Developed and maintained by XFastDataLab. For questions or collaborations, please contact MarveenLee.
Cite our work if you use GBSK or the AGC100M dataset!
@article{chen2025GBSK,
title={GBSK: Skeleton Clustering via Granular-Ball Computing and Multi-Sampling for Large-Scale Data},
author={Yewang Chen and Junfeng Li and Shuyin Xia and Qinghong Lai and Xinbo Gao and Guoyin Wang and Dongdong Cheng and Yi Liu and Yi Wang},
year={2025},
eprint={2509.23742},
archivePrefix={arXiv},
primaryClass={cs.LG}
}For more information and updates, visit the GBSK GitHub repository.



