Skip to content

michelhilg/sql-data-exploration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

COVID-19 Exploratory Data Analysis (EDA) SQL Code

Goal

The primary goal of this repository is to conduct exploratory data analysis (EDA) on COVID-19 data using SQL code and Snowflake. The code aims to extract valuable insights from the provided dataset, allowing users to better understand trends, patterns, and key metrics related to the pandemic from the start on 2020 until February 2024.

Snowflake Usage

This SQL code is designed to be executed on Snowflake, a cloud-based data warehousing platform. Snowflake provides scalable and flexible data storage and processing capabilities, making it suitable for handling large datasets with efficiency.

Data Source

The COVID-19 data used in this analysis is sourced from Our World in Data. The dataset includes comprehensive information on COVID-19 cases, deaths, vaccinations, and related metrics.

The data is initially presented in a single table. However, to demonstrate skills related to SQL data manipulation, it has been split into two main tables for this analysis. The modified schemas are accessible in the schemas.md file.

Data Organization

The COVID-19 data is organized into a Snowflake database with specific schemas and tables. Here is an overview of the data organization:

  • Database Name: COVID_DATABASE
  • Schemas:
    • information_schema: Views describing the contents of schemas in the database.
    • public: Contains publicly accessible data and information.

Tables

  1. public.coviddeaths

    • Contains information about COVID-19 cases and deaths.
  2. public.covidvaccinations

    • Stores data related to COVID-19 testing and vaccinations.

Code Explanation

The SQL code is divided into four main parts, each focusing on a specific aspect of the COVID-19 data. Here is a brief explanation of each part along with the key questions addressed:

  1. Data Overview (00_COVID_Data_Overview.sql):

    • Provides a general visualization of the two tables in the COVID-19 dataset.
  2. Brazilian Data (01_Brazil_Data_Analysis.sql):

    • Analyzes temporal trends of COVID-19 cases, testing and vaccination in Brazil.
    • Key Questions:
      • Q1: How has the likelihood of contracting COVID in Brazil evolved?
      • Q2: After contracting COVID, how has the likelihood of death evolved in Brazil?
      • Q3: How has the number of vaccinations evolved in Brazil?
      • Q4: Does vaccination impact the number of deaths per 100 cases in Brazil?
      • Q5: Does vaccination impact the number of new cases reported in Brazil?
  3. World Data (03_World_Data_Analysis.sql):

    • Analyzes temporal trends of COVID-19 cases in the world.
    • Key Questions:
      • Q1: What are the countries with the highest infection rate by population?
      • Q2: What are the countries with the highest death count?
      • Q3: What are the countries with the highest death rate by population?
      • Q4: Which continent has the highest death count?
      • Q5: What is the global death percentage over infect cases?
  4. Metric Comparison (03_Vaccination_Data_Analysis.sql):

    • Gain insights into the global patterns of COVID-19 vaccination.
    • Key Questions:
      • Q1: How are new vaccines concentrated in the pandemic across the world?
      • Q2: How have vaccination numbers evolved in the world?
      • Q3: Which countries have the highest number of vaccines applied per capita?

Analysis Results

The "queries-results.pdf" file provides a comprehensive overview of the findings derived from the COVID-19 data analysis. The document presents the answers for the questions outlined in the README file, based on visualizations and charts generated through Snowflake queries and tools.

Usage with Snowflake

  1. Create a free trial Snowflake account if you don't have one.
  2. Download the COVID data from Our World in Data.
  3. Create the database inside Snowflake with two tables as described in the schemas.md file.
  4. Execute the SQL code for the desired analysis.
  5. Review the results and visualizations generated by the queries.

Feel free to customize the queries or adapt the code to meet specific analysis requirements.

Further Analysis

This dataset has a lot of potential data for future and further data analysis, for example:

  • Investigate the correlation between various socioeconomic factors and health outcomes related to COVID-19.

About

This repository uses two COVID-related datasets to explore them using SQL tools and address a series of questions about the pandemic.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors