Welcome to the comprehensive repository for Azure Data Engineering - Basic to Advance. This project covers everything from foundational SQL and Big Data concepts to advanced cloud-based data orchestration, real-time streaming, and industrial-scale data pipelines using the Azure ecosystem.
| Category | Technologies |
|---|---|
| Cloud Platforms | Azure (Blob, ADLS Gen2, ADF, Synapse, Event Hub, Cosmos DB, Logic Apps), GCP (Dataproc, BigQuery, Pub/Sub) |
| Data Processing | Apache Spark (PySpark), Hive, Trino |
| Table Formats | Delta Lake, Apache Iceberg, Apache Hudi |
| Databases | MySQL, MongoDB, Cassandra, Snowflake |
| Streaming | Confluent Kafka, GCP Pub/Sub, Azure Event Hubs |
| Orchestration | Apache Airflow (Cloud Composer) |
| AI & Automation | n8n, OpenAI/ChatGPT, Agentic AI |
- Foundations: Database vs. DBMS/RDBMS, ACID Properties.
- MySQL: Setup Workbench, Integrity Constraints.
- Commands: DDL, DML, DQL, DCL (CREATE, INSERT, ALTER, DROP, TRUNCATE, DELETE).
- Advanced Querying: Joins, Subqueries, Window Functions, CTEs (Recursive & Iterative), CASE Statements.
- Concepts: Distributed Computation/Storage, Hadoop Architecture (HDFS, YARN, Map-Reduce).
- Hive: Architecture, Data Types, SerDe (CSV, JSON, Parquet, ORC), Partitioning (Static/Dynamic), Bucketing, and Joins.
- Architecture: Brokers, Topics, Partitions, Replicas, Offset Management.
- Programming: Producer/Consumer API (Sync/Async), Serialisation/Deserialisation (JSON, CSV).
- Cloud Integration: GCP Pub-Sub Setup and Producer/Consumer implementation.
- MongoDB: Atlas Setup, MongoDB Compass, Queries via Python, KSQLdb Integration.
- Cassandra: CAP Theorem, Architecture (SSTables, Mem-tables), Gossip Protocol, Consistency Levels, and CQL queries.
- Core Spark: RDDs, Dataframes, Spark Ecosystem, Narrow vs. Wide Transformations, Lazy Evaluation, DAGs.
- Optimizations: Caching/Persisting, Skewness Handling, Salting, Repartition vs. Coalesce, Memory Management.
- Streaming: Structured Streaming, Checkpointing, Watermarking, Windowed Aggregations (Tumbling/Sliding).
- Core Concepts: DAGs, Operators (Bash, Python), Tasks Dependency.
- Cloud Composer: Running DAGs on GCP, parallel tasks, and backfilling.
- Unity Catalog: Governance, Managed vs. External Tables, Volumes.
- Delta Lake: ACID transactions, Time Travel, Schema Evolution, Delta Sharing.
- DLT: Delta Live Tables Medallion Architecture.
- Modeling: Start Schema, Snowflake Schema, Galaxy Schema.
- Concepts: OLAP vs. OLTP, Fact & Dimension tables, SCD Types (SCD1, SCD2).
- Case Studies: Expedia & Swiggy Data Modeling.
- Snowflake: SnowPipe for event-driven ingestion, Storage Integration.
- BigQuery: Architecture (Capacitor, Colossus, Dremel), External vs. Managed Tables, ML/AI features (Gemini integration).
- Apache Iceberg: Metadata Layer, Manifests, CoW vs. MoR, Compaction.
- Apache Hudi: Incremental Pipelines, Multi-modal Indexes, Storage Layout.
- Distributed SQL: Architecture (Coordinator/Workers), Physical/Logical Plans, Connector ecosystem (MongoDB, GCS, etc.).
- Storage: Blob, ADLS Gen2.
- Serverless: Azure Functions (HTTP/Blob/Service Bus triggers), Logic Apps.
- Messaging: Service Bus (Queues/Topics), Event Hubs.
- Data Engineering: Data Factory (ADF), DataFlows, Synapse Analytics (SQL Pools, Spark Pools).
- Global Database: Cosmos DB.
- Governance: Key Vault, RBAC.
| # | Project Name | Tech Stack |
|---|---|---|
| 1 | Flight Booking Data Pipeline | Airflow, PySpark, Dataproc, BigQuery, CI/CD |
| 2 | E-commerce Event-Driven Pipeline | Databricks, Delta Lake, Workflows, GitHub |
| 3 | Travel Booking SCD2 Warehouse | Databricks, Unity Catalog, PyDeequ, Volumes |
| 4 | Healthcare DLT Medallion Pipeline | Databricks DLT, Delta Lake, SQL, Unity Catalog |
| 5 | UPI Transactions CDC Streaming | PySpark Structured Streaming, Delta CDF, Unity Catalog |
| 6 | News Data Analysis | Airflow, GCS, Python, Snowflake |
| 7 | Car Rental Batch Ingestion | Python, PySpark, GCP Dataproc, Airflow, Snowflake |
| 8 | Movie Booking CDC Aggregation | Snowflake Dynamic Tables, Streams, Tasks, Streamlit |
| 9 | Weather Forecast Processing | OpenWeather API, Airflow, PySpark, BigQuery, GCS |
| 10 | Unified Stock Trading Platform | Trino, MongoDB, BigQuery, Airflow, Python |
| 11 | Fintech SQL Data Migration | Azure Synapse, SQL DB, ADLS, PySpark, Logic Apps |
| 12 | AirBnB CDC Ingestion Pipeline | Azure ADF, Cosmos DB, Synapse, n8n, Agentic AI |
| 13 | BookMyShow Real-Time Pipeline | Azure Event Hub, Stream Analytics, Synapse, n8n |
| 14 | Airlines Incremental Processing | Azure DevOps, ADF, ADLS, GitHub |
| 15 | Ride Analytics Pipeline (Uber/Ola) | FastAPI, Docker, Cloud Run, Artifact Registry, GH Actions |
- Master in-demand data engineering skills from a real-time Lead
- Data Engineer and get job-ready for high-paying roles.