This project aims to perform K-means clustering computation using different parallel programming paradigms.
K-means clustering is an unsupervised learning algorithm that groups data points into K clusters so that points in the same cluster are "similar." In this project, we define similarity using Euclidean distance.
K-means is often described as embarrassingly parallel because the dominant work is calculating the distance between points, and that work is independent for each point.
Let
In this project, we primarily consider the case where
Assumptions about the order of magnitude:
-
$N$ :$10^7$ -
$K$ :$10^1$ -
$D$ :$10^1$
Therefore, we can effectively treat
At initialization, the algorithm chooses
For each point, we compute its distance to each centroid and assign it to the nearest one. This requires
However, because the distance computation is independent across points, we can easily scale the program by distributing
After a point is assigned to a cluster, we accumulate the sum (across all dimensions) and count for that cluster, which is later used to compute the mean. This requires
Then we update the centroid of each cluster by computing the mean of all points in that cluster. This requires a total of
Because this part depends on the full results of all points from the previous step, and because
After we compute the new centroids, we loop back to Step 2: Points Assignment and reassign points based on the new centroids. We stop the algorithm only if:
- Maximum iterations are reached (default: 100) or
- The centroids converge
There are 4 implementations in total. The Work-stealing refinement is the final version for this project, which is built on top of the MapReduce version.
A sequential version is implemented as the baseline for benchmarking.
We implemented a preliminary BSP version initially. However, because we used a locked data structure, the performance was very poor. We then improved the system with a MapReduce pattern.
The MapReduce pattern is a natural fit for K-means clustering computation.
The implementation consists of:
- a main thread
- M map worker threads
- a reducer thread
The workflow of the program:
- Initialize (main thread): N points to process, K centroids (randomly generated at initialization).
- Distribute (main thread): Define
job = (start, end): a chunk of points to be processed. Each worker receives a chunk of exactlyceil(N/M)points (except the last one, which gets the leftovers), which is close to perfectly balanced workload distribution. - Map (map worker thread): Each worker maps points to the nearest centroid and accumulates the sum and count for each centroid for average calculation.
- Shuffle & Reduce (reducer thread): The reducer groups and reduces sum and count by centroid ID, and computes the average as the new centroid.
- Iterate (main thread): Use the new centroid to re-compute (go back to Distribute step). Stop upon convergence or reaching max iteration.
The MapReduce framework parallelizes the computational hotspot in the Map step, where each chunk of points can be calculated independently without communication with other threads during processing. The only required communication is input (i.e., jobs) and output (i.e., results). This design decouples thread dependencies and minimizes communication overhead and the likelihood of errors.
The main challenge is synchronization between the Map and Reduce stages. The reducer must receive all results to compute means for the new centroids, and the mappers must wait for the new centroids to start (or stop) work for the next iteration.
We utilize channels for inter-thread communication and synchronization. The main thread distributes work via a job channel; mappers use a result channel to send results to the reducer; the reducer uses two channels, done and converge, to signal the main thread when it has finished reducing for the current iteration or reached convergence, respectively; and the main thread closes the job and result channels upon termination to prevent goroutine leaks.
The reasons for this design are:
- Lock-free: Using channels eliminates the need for mutex locks on shared data structures. This significantly improves performance compared with the BSP lock approach.
- Wait-free: As soon as a piece of data is sent to a channel, a thread listening on the other end (if available) can immediately pick it up and continue without waiting. For example, when a mapper finishes its work, it can immediately send results to the reducer. In a BSP-like pattern, we would need to wait for all mappers to finish before the reducer can start, or implement more complex synchronization.
- Race-free: Using channels avoids data races on shared data structures.
- Lightweight: We implement channels to pass pointers to structs so that the data being sent is minimal and lightweight.
The design also naturally exhibits a fan-out / fan-in communication pattern. The main thread fans out by distributing jobs to all worker goroutines through a shared job channel (one sender, many receivers). The workers then fan in by sending their partial results back through a shared result channel (many senders, one receiver — the reducer). This structure keeps coordination minimal: each goroutine only needs to interact with a channel, not with other goroutines directly.
We extend the MapReduce design with a work-stealing deque implementation (wsdeque).
The wsdeque is a lock-free, double-ended queue that stores jobs (i.e., start and end indices).
The top index is an atomic.Uint64 where the upper 32 bits are used for stamps and the lower 32 bits are used for indexing.
To facilitate the work-stealing design, we further reduce workload granularity. We use a constant to define the grain size of a single job. Based on the assumption that the input size is on the order of
We also implemented a reusable two-phase reverse-sense barrier for work-stealing mode. The main thread calls BroadcastReady() to start an iteration, each worker blocks on WaitForReady(expect) until that signal arrives, and then calls SignalDone() after finishing all available tasks (including steals). The main thread waits with WaitForDone() until all workers report completion, which gives a clean per-iteration boundary before the reducer's output is used for the next round. Internally, the barrier uses sync.Mutex + sync.Cond, a toggled ready flag (0/1) to avoid missed wakeups across iterations, and a Teardown() broadcast so workers can exit safely when convergence is reached.
One major challenge of the work-stealing refinement is that, instead of each worker handling one chunk of data, we must now handle results from a potentially large number of jobs.
Therefore, we adapted workers to send one result per job through the fan-in channel so the reducer can start working as soon as any job is completed by any worker. The reducer tracks the count of job results and stops once all expected results are received.
This communication pattern allows some concurrency in the Reduce stage, but recomputation of the new centroids still depends on the full result and thus remains sequential.
We benchmarked the four parallel implementations on UChicago's Linux cluster. For each parallel mode, we benchmarked with threads = {1,2,4,8,12}. Each trial was run 5 times, and the average time was computed.
From the project root:
(Optional) Create a virtual environment:
python -m venv venv
source venv/bin/activate
Install libraries and run the data generator.
pip install -r requirements.txt
# or manually install: numpy sklearn
cd proj3/data
python generator.py
This will generate data used in the benchmarking in data/input and data/answer directories.
The data is generated using the make_blobs from sklearn.datasets library.
See sample plot: data/generator.ipynb.
For K-means clustering output, stay in data/ directory:
go test -v
Note: Due to randomized initial centroids, some results may differ slightly. Therefore, the test includes a configurable parameter, minAccuracy, with a default value of 0.95. A test is considered passed if the result correctly clusters at least 95% of points.
For the wsdeque implementation:
# From project root
cd proj3/wsdeque
go test -v
From the project root:
cd proj3/benchmark
chmod +x benchmark.sh
./benchmark.sh
This submits all jobs on Slurm. Each job has a given mode and thread count. The sequential mode runs only once. Note that the BSP (with lock) mode is consistently slower than the sequential version and times out for larger inputs.
To simply run the program from the project root:
cd proj3/
# go run main/main.go <input_csv_file> <mode> <k> <dimension> <threads> [max-iteration=100]
# For example:
go run main/main.go data/input/medium_c2.csv w 2 2 8
Currently the only output is the duration of K-means clustering computation. The result can be easily printed by uncomment the two lines in main.go's main function.
Mode definition:
| Mode | Meaning |
|---|---|
| s | Sequential |
| b | BSP with lock |
| m | MapReduce |
| w | Work-stealing |
For full sets of output graphs and tables, please see benchmark/analyze.md or benchmark/analyze.ipynb.
Graph for the average speedup of different modes:
Speedup definition:
- Speedup = 1 → same as sequential
- Speedup > 1 → faster than sequential
- Speedup < 1 → slower than sequential
The BSP implementation performs consistently and significantly worse than the sequential baseline, and performance degrades further as the number of threads increases. This is caused by heavy lock contention on shared data structures, which introduces substantial synchronization overhead.
The MapReduce implementation performs substantially better. With one or two threads, the overhead of goroutines and channel communication offsets most of the parallel benefit. However, as the number of threads increases, the implementation achieves steady speedup. At 12 threads, the mean speedup reaches approximately 3.2×, indicating effective but sub-linear scaling.
The work-stealing refinement further improves performance. By splitting work into smaller granularity and allowing idle workers to steal tasks from others, the implementation achieves better load balancing. As a result, work-stealing consistently outperforms the standard MapReduce version, reaching a mean speedup of approximately 4.4× at 12 threads.
By applying Amdahl's Law with
It is also worth noting that at threads = 1, the MapReduce implementation achieves an average speedup very close to 1, indicating that the overhead introduced by the parallel framework is minimal, costing less than about 1% of performance on average. In contrast, the work-stealing refinement achieves an average speedup of approximately 0.8×, indicating noticeably higher overhead compared to the standard MapReduce implementation.
This difference primarily arises from the additional synchronization mechanisms required by the work-stealing design. The work-stealing deque relies on atomic operations to coordinate concurrent access, and the barrier implementation uses condition variables that may trigger goroutine context switches during synchronization.
While these mechanisms improve robustness and enable better load balancing at higher thread counts, they introduce additional runtime overhead. By comparison, the standard MapReduce implementation relies mostly on channels for inter-goroutine communication and coordination, which provides a relatively lightweight synchronization mechanism and results in lower overhead in the single-thread configuration.
The most significant bottleneck in the program is distance computation over
There are sequential bottlenecks in the Distribute and Reduce steps, but the amount of work is relatively small compared with the hotspot in the Map step. This conclusion is supported by timing different section of the program. Here are a sample output for work-stealing, the proportion remain consistent across multiple runs:
Set up: 3.275424ms
Reach covergence at iter: 3
Map duration: 91.449725ms
Difference between Map and Reduce duration: 17.194µs
Total duration of MapReduce: 91.693455ms
Construct clusters duration: 1.337µs
112 ms
We observe that more than 80% of the duration is spent on the Map stage.
Notably, the gap between the Map and Reduce, i.e. the wait time for reducer, is insignificant compared to the total duration, indicating a big success in the using the result channel to essentially "stream" the results from multiple worker to the reducer.
Even though the computation is parallelizable, performance can also be limited by memory rather than CPU, because each thread repeatedly reads the coordinates of millions of points to perform computation.
Let's assume we have float64, which is 8 bytes. Then the total size of the point array is
This working set is much larger than the typical CPU cache capacity, so the program must frequently access main memory. As the number of threads increases, aggregate memory traffic also increases. Once the available memory bandwidth is saturated, adding more threads no longer yields proportional speedup. This is one likely reason that the MapReduce and work-stealing implementations exhibit clear but sub-linear scaling.
In addition, large shared-memory machines may introduce further penalties due to NUMA effects, cache-coherence traffic, and cache-capacity misses. These effects can increase memory latency and reduce effective throughput per thread, especially when multiple threads access large data structures concurrently.
One mitigation used in this implementation is to assign contiguous chunks of the point array to each worker. This improves spatial locality and may reduce cache misses and unnecessary memory traffic. However, because the dataset remains much larger than cache, the benefit is limited.
The distance computation consists of very simple arithmetic operations repeated millions of times, which is a natural fit for GPU optimization. For
In the current implementation, centroids are recomputed sequentially because
Because K-means over large datasets can become memory-bound, future work can explore NUMA-aware placement and memory-locality optimizations. Examples include pinning worker threads to cores, allocating data closer to the threads that access it, and partitioning the dataset in a way that reduces remote memory access.
The work-stealing implementation currently uses a fixed grain size of 1000 points per task. Future work can study adaptive grain-size tuning based on dataset size, thread count, or observed runtime behavior. A better grain size may reduce scheduling overhead while still preserving the load-balancing benefit of stealing.
This project mainly targets the case where
The current implementation selects random initial centroids. Future work can incorporate smarter initialization methods such as K-means++, which may reduce the number of iterations required for convergence and improve both clustering accuracy and runtime. Multiple restarts can also be incorporated to minimize error and improve accuracy.
This project implemented and compared four versions of K-means clustering: sequential, BSP with locks, MapReduce, and work-stealing.
The results show that parallel performance depends heavily on synchronization design. The BSP version performed worst because lock contention introduced high overhead, while the MapReduce version achieved clear speedup by reducing shared-state contention.
The work-stealing refinement gave the best overall performance by improving load balance across workers. At 12 threads, it achieved the highest average speedup, about 4.4× over the sequential baseline. Overall, this project shows that K-means can benefit significantly from parallelism, but scalability is still limited by synchronization overhead and memory access costs.

