This project implements a distributed integer values sorting system using a MapReduce-like paradigm, with a master-worker architecture and gRPC for communication.
-
Master:
- Reads a configuration file (
config.yaml) which includes the list of worker addresses and the number of mappers/reducers. - Assigns mapper and reducer roles to workers.
- Assigns integer ranges to reducers and notifies them to mappers.
- Distributes chunks of input data to the mappers.
- Reads a configuration file (
-
Workers:
- Mappers receive chunks of unsorted integers, sort the data and split it into sub-chunks, basing division on reducers' assigned interval ranges, send the data to reducers, and then notify reducers when done.
- Reducers wait for all mappers to finish sending their data, then sort the collected data and write the output to a local file.
.
├── main.go
├── go.mod
├── master
│ └── master.go
├── worker
│ └── worker.go
├── proto
│ ├── mapreduce.proto
│ ├── mapreduce.pb.go
│ └── mapreduce_grpc.pb.go
├── generate_random_input.sh
├── config.yaml
└── input
config.yaml should list the workers and specify how many of them are mappers. For example:
workers:
- "localhost:50051"
- "localhost:50052"
- "localhost:50053"
- "localhost:50054"
- "localhost:50055"
- "localhost:50056"
- "localhost:50057"
- "localhost:50058"
mappers: 4The input file should contain one integer per line, for example:
10
3
5
1
2
9
8
7
4
6
-
Start the Workers
Open multiple terminals, one for each worker defined in
config.yaml:./mapreduce --mode=worker --port=:50051 ./mapreduce --mode=worker --port=:50052 ./mapreduce --mode=worker --port=:50053 ./mapreduce --mode=worker --port=:50054
Each worker will print a message indicating which port its listening on and will wait for role assignment.
-
Run the Master
In a separate terminal:
./mapreduce --mode=master --config=config.yaml --input=input
-
Processing Steps
The master:
- Reads the config and input file.
- Computes data ranges for the reducers.
- Assigns mappers and reducers roles, while advertising reducer ranges to mappers, and mappers total count to reducers.
- Distributes input data chunks to the mappers.
- Once done, the master exits.
The mappers:
- Receive input data chunks.
- Sort the data.
- Send sub-chunks to reducers based on the reducers’ assigned intervals.
- Notify every reducer when they've done.
The reducers:
- Wait for all mappers to finish sending data.
- Merge the received sub-chunks.
- Sort the merged data.
- Write data to output files.
Each reducer produces its own sorted output file, marking it with its port number. For example:
reducer__XXXXX_output.txt
These files contain the sorted integers that the reducer processed.
To stop the workers, press Ctrl+C in their respective terminals.
- The reducers do not produce a single merged file, each reducer’s output file contains a portion of the input data.
- Adjust
config.yamlandinputfile as necessary for your use case. - There is a generate_random_input.sh script that can be used to generate a large input file with one million random integers.
- Ensure all workers are running before starting the master.
- For subsequent runs, the workers can remain running, but the master must be restarted each time.