Skip to content

GBSK: Skeleton Clustering via Granular-ball Computing and Multi-Sampling for Large-Scale Complex Data

License

Notifications You must be signed in to change notification settings

XFastDataLab/GBSK

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

42 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GBSK: Skeleton Clustering via
Granular-Ball Computing and Multi-Sampling
for Large-Scale Data

arXiv Demo Datasets BibTeX

Yewang Chen, Junfeng Li, Shuyin Xia, Qinghong Lai, Xinbo Gao, Guoyin Wang, Dongdong Cheng, Yi Liu, Yi Wang

Granular-ball Skeleton Clustering (GBSK) is a scalable clustering algorithm designed for large-scale data. By constructing multi-grained granular-balls from sampled data, it approximates the underlying structure of data as a compact "skeleton," reducing computation while maintaining accuracy. With linear time complexity (O(n)), GBSK handles massive datasetsβ€”up to 100M points in 256 dimension. The adaptive variant, AGBSK, simplifies parameter tuning for ease of use.

e.g. SYN2 e.g. ChainLink

GBSK framework

πŸš€ Getting Started

Prerequisites

GBSK is currently implemented in MATLAB. The code requires MATLAB version R2021a or later.

Installation

  1. Clone the repositoy:

    git clone https://github.com/XFastDataLab/GBSK.git
  2. Add the GBSK/ directory of the repository to your MATLAB path:

    addpath(genpath('path_to_GBSK'));

Demo

Run demo/demo1.m or demo/demo2.m. Run demo/ClusteringQualityEvaluation.py to get evaluation results.

βš™οΈ Usage

Basic Workflow

  1. Prepare your dataset in .mat or .txt format, ensuring it contains a variable representing an $n \times d$ matrix (n instances, d feature).

  2. Place your dataset in the datasets/ directoy.

  3. Navigate to the algorithms/ directory in MATLAB.

  4. Run the main scrit: main.m for a dataset that does not exceed memory. Otherwise, edit and run main_big.m.

AGBSK: Recommended Configuration

AGBSK works well with default parameters, requiring only the cluster number k as input:

Parameter Description Default Value
k Number of clusters (root key balls representing centers) User-specified
s Number of sample sets 30
alpha Sampling proportion $\frac{1}{\sqrt{n}}$
M Number of balls per sample set $10 \times k$

GBSK: Advanced Configuration

For expert users, GBSK offers four adjustable parameters with empirically validated ranges:

Parameter Recommended Range Notes
s 10-30 Number of sample sets
alpha Dataset-dependent (e.g. 0.01) Ensure sufficient sampling coverage
k Based on prior knowledge Number of target clusters
M k < M <= n Number of balls per sample set

πŸ“Š Output

Upon execution, results are stored in the experiment_records/ directory, organized by dataset name and parameter settings. Each run includes:

  • labels.txt: Cluster labels assigned to each data pont.
  • aggRepBallCenters.txt: Centers of representative balls.
  • keyBallCenters.txt: Centers of final key balls.
  • log.txt: Detailed log of the run, including timing and parameter settigs.

πŸ“ˆ Performace

GBSK is designed for efficiency and scalabiity:

  • Time Complexity: $O(n)$, significantly reduced compared to traditional clustering methods on large dataets.
  • Scalability: Capable of handling datasets with millions of instances.
  • Robustness: Effective in the presence of noise and outliers due to multi-sampling and density-based techniques.

Time scaling comparison of algorithms

πŸ“ Datasets

GBSK has been tested on diverse synthetic and real-world datasets:

Dataset Instances Dims Classes Type Size Usage Label Description
Chainlink 1,000 3 2 S 20KB Q Y Two interlocking 3D rings
Twenty 1,000 2 20 S 17KB Q Y 20 evenly distributed clusters
SYN1 2,000 2 5 S 30KB Q Y Varying cluster densities
Segmentation 2,310 18 7 R 246KB Q Y Image segmentation data
SYN3 3,000 2 2 S 4KB Q Y Mixed density clusters
EngyTime 4,096 2 2 S 76KB Q Y Gaussian distributions with overlap
Waveform 5,000 21 3 R 520KB Q Y Physics waveform data
S3 5,000 2 15 S 136KB Q Y 15 Gaussian clusters
SYN2 5,800 2 7 S 7KB Q Y Non-convex shapes
Pendigits 10,992 16 10 R 685KB Q,S Y Handwritten digits
DryBean 13,611 16 7 R 3.6MB Q,S Y Bean classification
CIFAR-10 60,000 3,072 10 R 174MB Q,S Y Image classification
MNIST 70,000 784 10 R 121MB Q,S Y Handwritten digits
MoCap 78,095 36 5 R 43MB Q,S Y Motion capture data
CoverType 581K 54 7 R 71MB Q,S Y Forest cover types
3M2D5 3M 2 5 S 54MB Q,S Y Large Gaussian mixtures
N-BaIoT 7M 115 9 R 7.6GB S N IoT malware traffic
MNIST8M 8.1M 784 10 R 5.9GB Q,S Y Infinite MNIST
AGC100M 100M 256 17 S 95GB Q,S Y Ultra-large-scale benchmark
  • Type: S = Synthetic, R = Real-world
  • Usage: Q = Quality evaluation, S = Speed testing
  • Label: Y = Labeled, N = Unlabeled

πŸ“„ Licnse

This project is licensed under the GNU General Public License v3.0.

🀝 Acknowledgents

Developed and maintained by XFastDataLab. For questions or collaborations, please contact MarveenLee.

πŸ™ Citing GBSK

Cite our work if you use GBSK or the AGC100M dataset!

@article{chen2025GBSK,
  title={GBSK: Skeleton Clustering via Granular-Ball Computing and Multi-Sampling for Large-Scale Data},
  author={Yewang Chen and Junfeng Li and Shuyin Xia and Qinghong Lai and Xinbo Gao and Guoyin Wang and Dongdong Cheng and Yi Liu and Yi Wang},
  year={2025},
  eprint={2509.23742},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}

For more information and updates, visit the GBSK GitHub repository.


About

GBSK: Skeleton Clustering via Granular-ball Computing and Multi-Sampling for Large-Scale Complex Data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors