Skip to content

Latest commit

 

History

History
48 lines (35 loc) · 2.23 KB

File metadata and controls

48 lines (35 loc) · 2.23 KB

BDP - Assignment 1

Overview

In this assignment we simulate ingestion of the dataset 2018 Yellow Taxi Trip Data into a cassandra cluster, (mysimbdp-coredms). The application manages to serve several users at a time (we have tested it with up to 500 concurrent users), with no data loss, thanks to the RabbitMQ cluster employed.

Here is the platform design:


Project Structure

code
├── client.py
├── db
│   ├── docker-compose.yml
│   └── setuppers
│       ├── Dockerfile
│       └── setup_db.py
├── install_dependencies.sh
├── queue
│   ├── consumer.py
│   ├── docker-compose.yml
│   └── Dockerfile
├── README.md
├── run.sh
├── start_db.sh
├── start_queue.sh
└── utils
    ├── Dockerfile
    └── split_data.py

The deploy uses docker-compose for starting up 2 cassandra-db and 2 bitnami/rabbitmq containerised nodes. Other two components use containers as well:

both are pre-built on top of a docker image of python3 with cassandra-python as a dependency (in code/utils/Dockerfile), which I pushed to my docker repository so that it is automatically downloaded when composing.

The main building program is started by run, which triggers the composition of the docker-compose files (code/db/docker-compose.yml and code/queue/docker-compose.yml) and then a number of clients (client.py) specified by the user.

Logs can be found in logs with each file following the format: <log_type>_<client_number>.log, where log_type is one of db, queue, client and client_number is the number of concurrent clients run for the ingestion.