This project demonstrates a complete, enterprise-style data engineering solution built on Microsoft Azure. The objective is to ingest, process, transform, and publish COVID‑19 data using multiple Azure services, following modern data lake and analytics architecture patterns.
The solution integrates public health data from the European Centre for Disease Prevention and Control (ECDC) with population reference data to deliver accurate, scalable, and analytics-ready datasets. The final outputs are consumed through interactive Power BI dashboards for reporting and analysis.
The architecture follows a layered approach consisting of data sources, ingestion, staging, transformation, serving, and publishing layers. Each layer uses Azure-native services chosen for scalability, performance, and maintainability.
The solution architecture diagram illustrates how data flows from external and internal sources, through transformation engines, into a relational serving layer, and finally into business intelligence tools.
- Build an end-to-end data engineering pipeline using Microsoft Azure services
- Ingest structured and semi-structured data from multiple sources
- Store raw and processed data in a scalable data lake
- Perform both light and heavy data transformations using the right tools
- Deliver clean, analytics-ready data to a relational database
- Enable business users to analyze COVID‑19 trends using Power BI
- Demonstrate best practices in cloud-based data architecture
The primary dataset is sourced from the European Centre for Disease Prevention and Control (ECDC). This data is accessed through a public REST API using an HTTP connector.
The dataset includes COVID‑19 case counts, deaths, dates, and country-level reporting information. Because the source is external and regularly updated, it is ingested dynamically through Azure Data Factory.
Population reference data is stored in Azure Blob Storage. Azure Blob Storage is an object storage service optimized for unstructured and semi-structured data.
This population dataset is used to calculate ratios and metrics such as cases per capita and death rates by population.
Azure Data Factory acts as the central orchestration and ingestion service for the entire pipeline. ADF pipelines schedule, automate, and monitor data movement activities.
The following ingestion mechanisms are implemented:
- HTTP Linked Service to connect to the ECDC public REST API
- Blob Storage Linked Service to access population datasets
- Copy Activities to move raw data into the data lake
ADF ensures that data ingestion is reliable, repeatable, and auditable. All raw data is ingested without transformation to preserve source fidelity.
All ingested data is stored in Azure Data Lake Storage Gen2. This layer represents the raw or bronze layer of the data architecture.
ADLS Gen2 provides:
- Massively scalable storage
- Hierarchical namespace for folder-based organization
- Optimized performance for big data analytics
- Integration with Spark, Hive, and SQL-based services
Data in this layer is stored in its original format and structure, enabling reprocessing and historical analysis if needed.
The transformation layer uses multiple services, each selected based on the complexity and nature of the transformation tasks.
ADF is reused in this layer for lightweight, flow-based transformations. Mapping Data Flows and Copy Activities are used to perform:
- Schema mapping
- Column selection and renaming
- Simple aggregations
- Data cleansing and filtering
These transformations are ideal for structured data and low-to-medium complexity workloads.
Azure Databricks is used for advanced, compute-intensive transformations. It is based on Apache Spark and supports Python and PySpark notebooks.
Databricks is responsible for:
- Large-scale data joins
- Complex aggregations across multiple datasets
- Enrichment of COVID‑19 data with population data
- Performance-optimized transformations on large volumes of data
Notebook-driven development enables version control, experimentation, and collaboration across teams.
Azure HDInsight provides managed Hadoop and Hive clusters. It is used for SQL-style transformations directly on files stored in the data lake.
HDInsight enables:
- Hive SQL queries on data lake files
- Batch processing workloads
- Compatibility with traditional big data ecosystems
This allows data engineers familiar with SQL-based tooling to work efficiently within the data lake environment.
After transformation, clean and curated datasets are loaded into Azure SQL Database. This database acts as the serving or gold layer of the architecture.
Key characteristics of this layer include:
- Relational schema design
- Optimized indexes and tables
- High availability and reliability
- Fast query performance for analytics
This structured layer is designed specifically for downstream consumption by analytics and reporting tools.
Power BI connects directly to Azure SQL Database to build interactive dashboards. These dashboards allow users to explore and analyze COVID‑19 data visually.
Example insights provided by the dashboards include:
- Daily and cumulative case counts
- Trends over time by country and region
- Deaths and recovery statistics
- Population-adjusted metrics such as cases per 100,000 people
Power BI provides self-service analytics capabilities for stakeholders without requiring direct access to the data engineering infrastructure.
- End-to-end Azure data pipeline
- Multiple ingestion sources with ADF
- Raw data storage in ADLS Gen2
- Multi-engine transformation strategy (ADF, Databricks, HDInsight)
- Relational serving layer using Azure SQL Database
- Interactive Power BI dashboards
- Scalable and modular architecture
- Choosing the right Azure service for each processing requirement
- Designing layered data lake architectures
- Balancing cost, performance, and complexity
- Building pipelines that are reusable and extensible
- Integrating big data platforms with traditional BI tools
- Implement data quality checks and validation rules
- Add incremental data loading and change data capture
- Enhance monitoring and logging
- Integrate machine learning models for forecasting
- Automate infrastructure using Infrastructure as Code
This project showcases a real-world, production-style data engineering solution on Azure. It demonstrates how multiple Azure services can work together to ingest, transform, and deliver meaningful insights from raw data.
The architecture is scalable, modular, and extensible, making it a strong foundation for advanced analytics and data science workloads.