Skip to content
This repository was archived by the owner on May 20, 2023. It is now read-only.

Latest commit

 

History

History
232 lines (191 loc) · 9.52 KB

File metadata and controls

232 lines (191 loc) · 9.52 KB

Data Engineering Class

Data Engineering on Google Cloud - 5 Days

Module 1: Introduction to Data Engineering
Module 2: Building a Data Lake
Module 3: Building a Data Warehouse
Module 4: Introduction to Building Batch Data Pipelines,
Module 5: Executing Spark on Cloud Dataproc
Module 6: Serverless Data Processing with Cloud Dataflow
Module 7: Manage Data Pipelines with Cloud Data Fusion and Cloud Composer
Module 8: Introduction to Processing Streaming Data
Module 9: Serverless Messaging with Cloud Pub/Sub
Module 10: Cloud Dataflow Streaming Features
Module 11: High-Throughput BigQuery and Bigtable Streaming Features
Module 12: Advanced BigQuery Functionality and Performance

Links

Tutorials and demos

Admin Stuff

  • class Hours 10-6 EDT Last hour QA discussions
  • Lunch 12-1
  • Two breaks 15 min
  • Hands on
  • Materials - github repo

Introductions

  • Name you prefer
  • Area of expertise.
  • Background data engineering, data science, google and other cloud
  • What do you want the course to do for you?

Data Implementation Models

  • Hierarchical model
  • Network model
  • Relational Model
    • Data model - schema
    • Transactional process CRUD
    • Protect from corruption
    • OLTP - online transaction processing
    • Most queries tend to be the same
    • Row oriented data
  • Data warehouse
    • OLAP - online analytic processing
    • Dimensional model
    • Read only queries
    • Schema required
  • Nosql data bases
    • Not only SQL
    • Key-value store
    • Document store
      • Semistructure - email
    • Columnar formats
      • AVRO, ORC PARQUET

Data Modeling

- what does the data look, what does it mean, how is it organized
- https://www.dama.org/cpages/home

Data Prep

- Data managment - data as corporate ase
- Data compiant - 

Data use

- Data Analyst - What can we learn from the data
- Business Intel - How can tune our business
- Data Mining / ML - cluster analysis
- MLOps 

Cloud 101

  • DevOps, Infrastructure as code
  • Google Cloud
    • Colossus - Internal Storge
    • Jupiter - petabit network

Class Project

  • Global Warming

Data Quality

  • Fnu Katheswaranath
  • Joashua Rodson
  • Amber Roddottir


NoSQL

  1. Key value store
  2. Document database - semi structure
    • MongoDB
  3. Columnar Storage
    • Sparce data set
index colum  [ca: [....], cz: [.....]]
  • Data Drift
  • Concept drift

Nature based - emulating natural behaviours - emulate natural cognition Rational AI - developing ML models that are use - Rational actors

  • Datraproc
    • specificy parameter for a hdfs cluster

Resources

Public Examples Repository

Coursera

Practice exam

References

The following are the all references pulled from the class materials

Module 1-01 - Data Engineering

Module 1-02 - Building a Data Lake

Module 1-03 Building a data warehouse

Module 2-01 Introduction to Building Batch Data Pipelines

Module 2-02 Batch Processing of Data with Spark and Hadoop on GCP_M2 - Executing Spark on Cloud Dataproc

Module 2-03 Serverless Data Processing with Cloud Dataflow.

Module 2-04 Manage Data Pipelines with Cloud Data Fusion and Cloud Composer

Module 3-01 Introduction to Processing Streaming Data

  • None

Module 3-02 Serverless Messaging with Cloud Pub_Sub

  • None

Module 3-03 Cloud Dataflow Streaming Features

Module 3-04 High-Throughput BigQuery and Bigtable Streaming Features

Module 3-05 Advanced BigQuery Functionality and Performance

Module 4-01 Introduction to Analytics and AI

Module 4-02 Prebuilt ML model APIs for Unstructured Data

Module 4-03 Big Data Analytics with Cloud AI Platform Notebooks

  • None

Module 4-04 Production ML Pipelines with Kubeflow

  • None

Module 4-05 Custom Model building with SQL in BigQuery ML

Module 4-06 Custom Model building with Cloud AutoML