Data Engineering Project on NY Parking Violations (~50M) Data using Docker, Airflow, AWS, Snowflake, dbt, Tableau
Notes on the Project: Data Engineering NYC Parking Violations Notes
This data engineering project involves processing and analyzing NYC parking violations data, which consists of approximately 50 million records. Below is a brief overview of the workflow and the steps involved.
- Introduction
- Data Source
- Data Ingestion and Storage
- Data Warehousing
- Data Transformation
- Data Visualization and Reporting
- Programming Languages and Tools
- Project Setup
- Usage
- Video Link
This project aims to process and analyze NYC parking violation data(~ 50M) to derive insights. The workflow involves data ingestion, storage, transformation, and visualization using various tools and technologies.
The data originates from NYC OpenData, which provides a large dataset on parking violations in New York City.
- Apache Airflow is used to orchestrate the data pipeline. Airflow automates extracting the parking violation data from the NYC OpenData portal.
- The extracted data is then stored in Amazon S3 (Simple Storage Service), which serves as the staging area for the raw data.
From Amazon S3, the data is loaded into Snowflake, a cloud-based data warehousing solution. This process involves transforming the data into a structured format suitable for analysis.
dbt (data build tool) transforms the data within Snowflake. dbt allows for transforming raw data into a more refined state, ready for analysis.
Once the data is transformed and stored in Snowflake, Tableau is used for creating visualizations, dashboards, and reports. This step enables stakeholders to derive insights from the parking violations data.
- Python and SQL are the primary programming languages used throughout the project. Python is used for scripting and automation tasks, while SQL is used for querying and managing the data within Snowflake.
- The entire workflow operates within a Docker container, ensuring a consistent environment for all components of the project.
- The project runs on a Linux operating system, providing a robust and scalable platform for the data pipeline.
To set up the project, follow these steps:
-
Clone the repository:
git clone https://github.com/BadreeshShetty/Data-Engineering-ETL-Airflow-DBT-Parking.git cd Data-Engineering-ETL-Airflow-DBT-Parking -
Build and start the Docker containers:
docker-compose up --build -d
-
Access the Airflow web UI to monitor the data pipeline.
- Trigger the Airflow DAG to start the data ingestion process.
- Monitor the data extraction and loading into Amazon S3.
- Check the data transformation process in Snowflake using dbt.
- Access Tableau to create visualizations and reports based on the transformed data.
Video Link: https://youtu.be/PNe9POTPx4I
