coreset-sc

This repository provides an approximate spectral clustering algorithm that can scale far beyond the original algorithm, while still producing similar results.

This repo contains a minimal implementation of the Coreset Spectral Clustering (CSC) algorithm given in this paper.

The presented method repeatedly jumps across the equivalence between the normlised cut and weighted kernel k-means problems to apply coreset methods to spectral clustering.

Combined with recent work on fast spectral clustering, this gives us a method for clustering very large graphs (millions of nodes) in seconds by only running (fast) spectral clustering on a much smaller induced subgraph. It can do so in even the most extreme case, where the number of clusters is linear in the number of nodes. See the experiments section the paper.

Installation: pip install coreset-sc

Basic usage:

from coreset_sc import CoresetSpectralClustering, gen_sbm
from sklearn.metrics.cluster import adjusted_rand_score

# Generate a graph from the stochastic block model
n = 1000            # number of nodes per cluster
k = 50              # number of clusters
p = 0.5             # probability of an intra-cluster edge
q = (1.0 / n) / k   # probability of an inter-cluster edge


# A is a sparse scipy CSR matrix of a symmetric adjacency graph
A,ground_truth_labels = gen_sbm(n, k, p, q)

coreset_ratio = 0.1 # fraction of the data to use for the coreset graph

csc = CoresetSpectralClustering(
    num_clusters=k, coreset_ratio=coreset_ratio
)
csc.fit(A) # sample extract and cluster the coreset graph
csc.label_full_graph() # label the rest of the graph given the coreset labels
pred_labels = csc.labels_ # get the full labels

# Alternatively, label the full graph in one line:
pred_labels = csc.fit_predict(A)
ari = adjusted_rand_score(ground_truth_labels,pred_labels)

# Now we show how to use a custom graph clustering algorithm for the coreset graph:
from sklearn.cluster import SpectralClustering
csc = CoresetSpectralClustering(
    num_clusters=k,  # required
    coreset_ratio=coreset_ratio,
    # Optional parameters:
    k_over_sampling_factor=2.0,
    shift=0.01,
)

coreset_graph = csc.get_coreset_graph(A)

sc = SpectralClustering(
    n_clusters=k,
    affinity='precomputed',
    random_state=42,
)
coreset_labels = sc.fit_predict(coreset_graph)
csc.set_coreset_graph_labels(coreset_labels)

# Now label the full graph using the coreset labels
csc.label_full_graph()
pred_labels = csc.labels_
ari = adjusted_rand_score(ground_truth_labels, pred_labels)
print(ari)

Python Docs

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
.vscode		.vscode
benches		benches
docs		docs
python/coreset_sc		python/coreset_sc
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
csc.png		csc.png
example.py		example.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

coreset-sc

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

BenJourdan/coreset-sc

Folders and files

Latest commit

History

Repository files navigation

coreset-sc

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages