BDP - Assignment 1

Overview

In this assignment we simulate ingestion of the dataset 2018 Yellow Taxi Trip Data into a cassandra cluster, (mysimbdp-coredms). The application manages to serve several users at a time (we have tested it with up to 500 concurrent users), with no data loss, thanks to the RabbitMQ cluster employed.

Here is the platform design:

Project Structure

code
├── client.py
├── db
│   ├── docker-compose.yml
│   └── setuppers
│       ├── Dockerfile
│       └── setup_db.py
├── install_dependencies.sh
├── queue
│   ├── consumer.py
│   ├── docker-compose.yml
│   └── Dockerfile
├── README.md
├── run.sh
├── start_db.sh
├── start_queue.sh
└── utils
    ├── Dockerfile
    └── split_data.py

The deploy uses docker-compose for starting up 2 cassandra-db and 2 bitnami/rabbitmq containerised nodes. Other two components use containers as well:

the program setting up the database keyspace (setup_db, code/db/setuppers/Dockerfile);
the program in charge of dequeuing the messages forwarded to the RabbitMQ cluster (consumer, code/queue/Dockerfile).

both are pre-built on top of a docker image of python3 with cassandra-python as a dependency (in code/utils/Dockerfile), which I pushed to my docker repository so that it is automatically downloaded when composing.

The main building program is started by run, which triggers the composition of the docker-compose files (code/db/docker-compose.yml and code/queue/docker-compose.yml) and then a number of clients (client.py) specified by the user.

Logs can be found in logs with each file following the format: <log_type>_<client_number>.log, where log_type is one of db, queue, client and client_number is the number of concurrent clients run for the ingestion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BDP - Assignment 1

Overview

Project Structure

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

BDP - Assignment 1

Overview

Project Structure