GitHub - Bijay555/rideco-analytics-portfolio: Built an end-to-end ELT pipeline for a transportation analytics use case.

🚖 RIDECO Taxi Analytics Engineering Pipeline

"Building the data foundation that powers smarter transit solutions."

📖 Project Overview

This project simulates a production-grade ELT (Extract, Load, Transform) pipeline for a ride-sharing company similar to RideCo.

The goal was not just to visualize data, but to build a robust Analytics Engineering architecture that ensures data accuracy, modularity, and performance. I ingested millions of rows of NYC Taxi data, modeled it using a Star Schema, and built an interactive dashboard for stakeholders.

🏗️ The Architecture


graph LR
    A[NYC TLC API] -->|Extract (Python/Pandas)| B(Airflow DAG)
    B -->|Load (Append)| C[(PostgreSQL Raw)]
    C -->|Transform (dbt)| D{dbt Staging View}
    D -->|Materialize| E[dbt Fact Table]
    E -->|Viz| F[Streamlit Dashboard]
    
    subgraph Quality Gates
    D -- Tests --> G[dbt test]
    end

🚀 Key Features & Engineering Decisions

Robust Data Ingestion (Airflow & Python)

Instead of manual CSV uploads, I built an automated pipeline using Apache Airflow.

Idempotency: The pipeline accepts a logical_date parameter. If I need to backfill data for January 2024, I can trigger that specific run date, and the script dynamically generates the correct URL.
Memory Management: Utilized pandas chunking to handle large Parquet files without crashing memory.
Security: All database credentials are managed via environment variables (.env), ensuring no secrets are hardcoded.

Analytics Engineering (dbt)

Transformation logic is decoupled from the visualization layer using dbt (data build tool).

Staging Layer (Views): Cleans raw column names (e.g., tpep_pickup_datetime → pickup_at) and handles type casting.
Marts Layer (Tables): Materialized as physical tables for high-performance querying in the dashboard.
Business Logic: engineered metrics like duration_min and filtered out data anomalies (e.g., negative fares or 0-mile trips).

Data Quality & Governance

Trust is the currency of data. I implemented automated testing to catch bugs before they reach the dashboard:

Schema Tests: Enforced unique and not_null constraints on Primary Keys.
Business Tests: Validated that dropoff_time > pickup_time and passenger_count > 0.

Code-First Visualization (Streamlit)

Instead of a static report, I built a dynamic Python web app.

Insight: Analyze peak rush hours to optimize fleet allocation.
Performance: Queries the pre-aggregated fact_trips table for sub-second load times.

📂 Project Structure


rideco-analytics-portfolio/
├── dags/                  # Airflow DAGs for orchestration
│   └── rideco_pipeline.py # The Extract & Load Logic
├── dbt_rideco/            # The Transformation Layer
│   ├── models/
│   │   ├── staging/       # Cleaning Views
│   │   └── marts/         # Final Fact Tables
│   └── tests/             # Data Quality Checks
├── scripts/               # Helper Python scripts
│   └── ingest_data.py     # Standalone ingestion script
├── dashboard.py           # Streamlit Visualization App
└── requirements.txt       # Python dependencies

🛠️ How to Run This Project

Prerequisites

Ubuntu / Linux / Mac (or WSL for Windows)

Python 3.8+

PostgreSQL

Setup Environment

Clone the repo and install dependencies:


git clone [https://github.com/Bijay555/rideco-analytics-portfolio.git](https://github.com/Bijay555/rideco-analytics-portfolio.git)
cd rideco-analytics-portfolio
pip install -r requirements.txt

Configure Credentials

Create a .env file in the root directory:

DB_USER=username 
DB_PASSWORD=your_password
DB_HOST=localhost
DB_PORT=5432
DB_NAME=database_name

Run the Pipeline

You can run the ingestion script manually or trigger the Airflow DAG:

python3 scripts/ingest_data.py

Transform Data

Run the dbt models and tests:

cd dbt_rideco
dbt run   # Builds the tables
dbt test  # Validates data quality

Launch Dashboard

Start the local web server:

cd ..
streamlit run dashboard.py

📊 Business Insights

Through this analysis, I identified:

Peak Congestion: Average trip duration spikes by 40% between 5 PM and 6 PM on weekdays.

Revenue Drivers: Trips to/from JFK Airport yield the highest average fare but have high variability in duration.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
dags		dags
dbt_rideco		dbt_rideco
images		images
scripts		scripts
.gitignore		.gitignore
README.md		README.md
dashboard.py		dashboard.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚖 RIDECO Taxi Analytics Engineering Pipeline

📖 Project Overview

🏗️ The Architecture

🚀 Key Features & Engineering Decisions

📂 Project Structure

📊 Business Insights

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚖 RIDECO Taxi Analytics Engineering Pipeline

📖 Project Overview

🏗️ The Architecture

🚀 Key Features & Engineering Decisions

📂 Project Structure

📊 Business Insights

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages