K-means Clustering using Parallel Programming

This project aims to perform K-means clustering computation using different parallel programming paradigms.

Introduction

K-means clustering is an unsupervised learning algorithm that groups data points into K clusters so that points in the same cluster are "similar." In this project, we define similarity using Euclidean distance.

K-means is often described as embarrassingly parallel because the dominant work is calculating the distance between points, and that work is independent for each point.

Computation and Time Complexity

Let $N$ be the number of input points, $K$ be the number of centroids, and $D$ be the dimensions (features) of each point.

In this project, we primarily consider the case where $N$ is the dominant factor.

Assumptions about the order of magnitude:

$N$: $10^7$
$K$: $10^1$
$D$: $10^1$

Therefore, we can effectively treat $K$ and $D$ as "constant" in the time complexity analysis.

1. Initialization

At initialization, the algorithm chooses $K$ random centroids from the $N$ given points.

2. Points Assignment

For each point, we compute its distance to each centroid and assign it to the nearest one. This requires $K$ distance calculations per point, where each distance calculation performs $O(D)$ arithmetic operations. Therefore, the total computation is $O(NKD)$.

However, because the distance computation is independent across points, we can easily scale the program by distributing $N/M$ points to $M$ threads to work concurrently. In theory, if we have infinite threads, we can reduce the time down to effectively $O(KD)$.

3. Accumulate

After a point is assigned to a cluster, we accumulate the sum (across all dimensions) and count for that cluster, which is later used to compute the mean. This requires $O(D)$ arithmetic operations for each point, so the total time is $O(ND)$. The accumulation requires some synchronization on the sum and count but can be performed concurrently as points are assigned.

4. Recompute Centroid

Then we update the centroid of each cluster by computing the mean of all points in that cluster. This requires a total of $O(KD)$ operations, which is dominated by distance computation.

Because this part depends on the full results of all points from the previous step, and because $K$ and $D$ are assumed to be small constants, we do not parallelize this part.

5. Iteration

After we compute the new centroids, we loop back to Step 2: Points Assignment and reassign points based on the new centroids. We stop the algorithm only if:

Maximum iterations are reached (default: 100) or
The centroids converge

Parallel Programming Design

There are 4 implementations in total. The Work-stealing refinement is the final version for this project, which is built on top of the MapReduce version.

Sequential

A sequential version is implemented as the baseline for benchmarking.

BSP (with Lock)

We implemented a preliminary BSP version initially. However, because we used a locked data structure, the performance was very poor. We then improved the system with a MapReduce pattern.

MapReduce

The MapReduce pattern is a natural fit for K-means clustering computation.

The implementation consists of:

a main thread
M map worker threads
a reducer thread

The workflow of the program:

Initialize (main thread): N points to process, K centroids (randomly generated at initialization).
Distribute (main thread): Define job = (start, end): a chunk of points to be processed. Each worker receives a chunk of exactly ceil(N/M) points (except the last one, which gets the leftovers), which is close to perfectly balanced workload distribution.
Map (map worker thread): Each worker maps points to the nearest centroid and accumulates the sum and count for each centroid for average calculation.
Shuffle & Reduce (reducer thread): The reducer groups and reduces sum and count by centroid ID, and computes the average as the new centroid.
Iterate (main thread): Use the new centroid to re-compute (go back to Distribute step). Stop upon convergence or reaching max iteration.

Design Justification

The MapReduce framework parallelizes the computational hotspot in the Map step, where each chunk of points can be calculated independently without communication with other threads during processing. The only required communication is input (i.e., jobs) and output (i.e., results). This design decouples thread dependencies and minimizes communication overhead and the likelihood of errors.

The main challenge is synchronization between the Map and Reduce stages. The reducer must receive all results to compute means for the new centroids, and the mappers must wait for the new centroids to start (or stop) work for the next iteration.

We utilize channels for inter-thread communication and synchronization. The main thread distributes work via a job channel; mappers use a result channel to send results to the reducer; the reducer uses two channels, done and converge, to signal the main thread when it has finished reducing for the current iteration or reached convergence, respectively; and the main thread closes the job and result channels upon termination to prevent goroutine leaks.

The reasons for this design are:

Lock-free: Using channels eliminates the need for mutex locks on shared data structures. This significantly improves performance compared with the BSP lock approach.
Wait-free: As soon as a piece of data is sent to a channel, a thread listening on the other end (if available) can immediately pick it up and continue without waiting. For example, when a mapper finishes its work, it can immediately send results to the reducer. In a BSP-like pattern, we would need to wait for all mappers to finish before the reducer can start, or implement more complex synchronization.
Race-free: Using channels avoids data races on shared data structures.
Lightweight: We implement channels to pass pointers to structs so that the data being sent is minimal and lightweight.

Fan-out / Fan-in

The design also naturally exhibits a fan-out / fan-in communication pattern. The main thread fans out by distributing jobs to all worker goroutines through a shared job channel (one sender, many receivers). The workers then fan in by sending their partial results back through a shared result channel (many senders, one receiver — the reducer). This structure keeps coordination minimal: each goroutine only needs to interact with a channel, not with other goroutines directly.

Work-stealing Refinement

We extend the MapReduce design with a work-stealing deque implementation (wsdeque).

WSDeque

The wsdeque is a lock-free, double-ended queue that stores jobs (i.e., start and end indices). The top index is an atomic.Uint64 where the upper 32 bits are used for stamps and the lower 32 bits are used for indexing.

To facilitate the work-stealing design, we further reduce workload granularity. We use a constant to define the grain size of a single job. Based on the assumption that the input size is on the order of $N=10^7$, we adopt a grain size of 1000. This is a balanced choice between work-distribution overhead and the effectiveness of work stealing.

Barrier

We also implemented a reusable two-phase reverse-sense barrier for work-stealing mode. The main thread calls BroadcastReady() to start an iteration, each worker blocks on WaitForReady(expect) until that signal arrives, and then calls SignalDone() after finishing all available tasks (including steals). The main thread waits with WaitForDone() until all workers report completion, which gives a clean per-iteration boundary before the reducer's output is used for the next round. Internally, the barrier uses sync.Mutex + sync.Cond, a toggled ready flag (0/1) to avoid missed wakeups across iterations, and a Teardown() broadcast so workers can exit safely when convergence is reached.

Communication

One major challenge of the work-stealing refinement is that, instead of each worker handling one chunk of data, we must now handle results from a potentially large number of jobs.

Therefore, we adapted workers to send one result per job through the fan-in channel so the reducer can start working as soon as any job is completed by any worker. The reducer tracks the count of job results and stops once all expected results are received.

This communication pattern allows some concurrency in the Reduce stage, but recomputation of the new centroids still depends on the full result and thus remains sequential.

Benchmarking

We benchmarked the four parallel implementations on UChicago's Linux cluster. For each parallel mode, we benchmarked with threads = {1,2,4,8,12}. Each trial was run 5 times, and the average time was computed.

Instructions

Data Generation

From the project root:

(Optional) Create a virtual environment:

python -m venv venv
source venv/bin/activate

Install libraries and run the data generator.

pip install -r requirements.txt
# or manually install: numpy sklearn
cd proj3/data
python generator.py

This will generate data used in the benchmarking in data/input and data/answer directories.

The data is generated using the make_blobs from sklearn.datasets library.

See sample plot: data/generator.ipynb.

Testing

For K-means clustering output, stay in data/ directory:

go test -v

Note: Due to randomized initial centroids, some results may differ slightly. Therefore, the test includes a configurable parameter, minAccuracy, with a default value of 0.95. A test is considered passed if the result correctly clusters at least 95% of points.

For the wsdeque implementation:

# From project root
cd proj3/wsdeque
go test -v

Benchmark

From the project root:

cd proj3/benchmark
chmod +x benchmark.sh
./benchmark.sh

This submits all jobs on Slurm. Each job has a given mode and thread count. The sequential mode runs only once. Note that the BSP (with lock) mode is consistently slower than the sequential version and times out for larger inputs.

Run

To simply run the program from the project root:

cd proj3/
# go run main/main.go <input_csv_file> <mode> <k> <dimension> <threads> [max-iteration=100]
# For example:
go run main/main.go data/input/medium_c2.csv w 2 2 8

Currently the only output is the duration of K-means clustering computation. The result can be easily printed by uncomment the two lines in main.go's main function.

Mode definition:

Mode	Meaning
s	Sequential
b	BSP with lock
m	MapReduce
w	Work-stealing

Performance Analysis

For full sets of output graphs and tables, please see benchmark/analyze.md or benchmark/analyze.ipynb.

Graph for the average speedup of different modes:

Speedup

Speedup definition:

$$\text{speedup} = \frac{T_{sequential}}{T_{parallel}}$$

Speedup = 1 → same as sequential
Speedup > 1 → faster than sequential
Speedup < 1 → slower than sequential

The BSP implementation performs consistently and significantly worse than the sequential baseline, and performance degrades further as the number of threads increases. This is caused by heavy lock contention on shared data structures, which introduces substantial synchronization overhead.

The MapReduce implementation performs substantially better. With one or two threads, the overhead of goroutines and channel communication offsets most of the parallel benefit. However, as the number of threads increases, the implementation achieves steady speedup. At 12 threads, the mean speedup reaches approximately 3.2×, indicating effective but sub-linear scaling.

The work-stealing refinement further improves performance. By splitting work into smaller granularity and allowing idle workers to steal tasks from others, the implementation achieves better load balancing. As a result, work-stealing consistently outperforms the standard MapReduce version, reaching a mean speedup of approximately 4.4× at 12 threads.

By applying Amdahl's Law with $n=12$, we can calculate that $p=0.75$ for MapReduce and $p=0.84$ for work-stealing. This finding shows that the MapReduce design successfully parallelize the majority of the program.

Parallel Overhead

It is also worth noting that at threads = 1, the MapReduce implementation achieves an average speedup very close to 1, indicating that the overhead introduced by the parallel framework is minimal, costing less than about 1% of performance on average. In contrast, the work-stealing refinement achieves an average speedup of approximately 0.8×, indicating noticeably higher overhead compared to the standard MapReduce implementation.

This difference primarily arises from the additional synchronization mechanisms required by the work-stealing design. The work-stealing deque relies on atomic operations to coordinate concurrent access, and the barrier implementation uses condition variables that may trigger goroutine context switches during synchronization.

While these mechanisms improve robustness and enable better load balancing at higher thread counts, they introduce additional runtime overhead. By comparison, the standard MapReduce implementation relies mostly on channels for inter-goroutine communication and coordination, which provides a relatively lightweight synchronization mechanism and results in lower overhead in the single-thread configuration.

Computation Bottleneck

The most significant bottleneck in the program is distance computation over $N$ points, which is parallelized in these implementations.

There are sequential bottlenecks in the Distribute and Reduce steps, but the amount of work is relatively small compared with the hotspot in the Map step. This conclusion is supported by timing different section of the program. Here are a sample output for work-stealing, the proportion remain consistent across multiple runs:

Set up: 3.275424ms
Reach covergence at iter: 3
Map duration: 91.449725ms
Difference between Map and Reduce duration: 17.194µs
Total duration of MapReduce: 91.693455ms
Construct clusters duration: 1.337µs
112 ms

We observe that more than 80% of the duration is spent on the Map stage.

Notably, the gap between the Map and Reduce, i.e. the wait time for reducer, is insignificant compared to the total duration, indicating a big success in the using the result channel to essentially "stream" the results from multiple worker to the reducer.

Memory Bottleneck

Even though the computation is parallelizable, performance can also be limited by memory rather than CPU, because each thread repeatedly reads the coordinates of millions of points to perform computation.

Let's assume we have $10^7$ points, each with 10 features (dimensions), each feature is a float64, which is 8 bytes. Then the total size of the point array is

$$10^7 * 10 * 8 = 8 * 10^8\text{bytes}=800\text{ MB}$$

This working set is much larger than the typical CPU cache capacity, so the program must frequently access main memory. As the number of threads increases, aggregate memory traffic also increases. Once the available memory bandwidth is saturated, adding more threads no longer yields proportional speedup. This is one likely reason that the MapReduce and work-stealing implementations exhibit clear but sub-linear scaling.

In addition, large shared-memory machines may introduce further penalties due to NUMA effects, cache-coherence traffic, and cache-capacity misses. These effects can increase memory latency and reduce effective throughput per thread, especially when multiple threads access large data structures concurrently.

One mitigation used in this implementation is to assign contiguous chunks of the point array to each worker. This improves spatial locality and may reduce cache misses and unnecessary memory traffic. However, because the dataset remains much larger than cache, the benefit is limited.

Future Work

GPU Acceleration

The distance computation consists of very simple arithmetic operations repeated millions of times, which is a natural fit for GPU optimization. For $N=10^7$ points, we can launch $10^7$ threads on a GPU grid, and each thread only needs to compute one point. The main challenges are reduction and memory access.

Parallelize Centroid Recomputation

In the current implementation, centroids are recomputed sequentially because $K$ and $D$ are relatively small in the target setting. Future work can target high-dimensional data or a large number of clusters by parallelizing this part to reduce the sequential bottleneck.

Improve Memory Locality

Because K-means over large datasets can become memory-bound, future work can explore NUMA-aware placement and memory-locality optimizations. Examples include pinning worker threads to cores, allocating data closer to the threads that access it, and partitioning the dataset in a way that reduces remote memory access.

Grain Size

The work-stealing implementation currently uses a fixed grain size of 1000 points per task. Future work can study adaptive grain-size tuning based on dataset size, thread count, or observed runtime behavior. A better grain size may reduce scheduling overhead while still preserving the load-balancing benefit of stealing.

Benchmarking

This project mainly targets the case where $N$ is very large while $K$ and $D$ remain relatively small. Future work can evaluate performance under other regimes, such as larger $K$, larger $D$, skewed cluster distributions, or datasets with uneven computational cost per point, to better understand when each parallel design is most effective.

Accuracy

The current implementation selects random initial centroids. Future work can incorporate smarter initialization methods such as K-means++, which may reduce the number of iterations required for convergence and improve both clustering accuracy and runtime. Multiple restarts can also be incorporated to minimize error and improve accuracy.

Conclusion

This project implemented and compared four versions of K-means clustering: sequential, BSP with locks, MapReduce, and work-stealing.

The results show that parallel performance depends heavily on synchronization design. The BSP version performed worst because lock contention introduced high overhead, while the MapReduce version achieved clear speedup by reducing shared-state contention.

The work-stealing refinement gave the best overall performance by improving load balance across workers. At 12 threads, it achieved the highest average speedup, about 4.4× over the sequential baseline. Overall, this project shows that K-means can benefit significantly from parallelism, but scalability is still limited by synchronization overhead and memory access costs.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
proj3		proj3
.gitignore		.gitignore
README.md		README.md
diagram.png		diagram.png
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

K-means Clustering using Parallel Programming

Introduction

Computation and Time Complexity

1. Initialization

2. Points Assignment

3. Accumulate

4. Recompute Centroid

5. Iteration

Parallel Programming Design

Sequential

BSP (with Lock)

MapReduce

Design Justification

Fan-out / Fan-in

Work-stealing Refinement

WSDeque

Barrier

Communication

Benchmarking

Instructions

Data Generation

Testing

Benchmark

Run

Performance Analysis

Speedup

Parallel Overhead

Computation Bottleneck

Memory Bottleneck

Future Work

GPU Acceleration

Parallelize Centroid Recomputation

Improve Memory Locality

Grain Size

Benchmarking

Accuracy

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages