Data Engineering Class

Data Engineering on Google Cloud - 5 Days

Module 1: Introduction to Data Engineering
Module 2: Building a Data Lake
Module 3: Building a Data Warehouse
Module 4: Introduction to Building Batch Data Pipelines,
Module 5: Executing Spark on Cloud Dataproc
Module 6: Serverless Data Processing with Cloud Dataflow
Module 7: Manage Data Pipelines with Cloud Data Fusion and Cloud Composer
Module 8: Introduction to Processing Streaming Data
Module 9: Serverless Messaging with Cloud Pub/Sub
Module 10: Cloud Dataflow Streaming Features
Module 11: High-Throughput BigQuery and Bigtable Streaming Features
Module 12: Advanced BigQuery Functionality and Performance

Admin Stuff

class Hours 10-6 EDT Last hour QA discussions
Lunch 12-1
Two breaks 15 min
Hands on
Materials - github repo

Introductions

Name you prefer
Area of expertise.
Background data engineering, data science, google and other cloud
What do you want the course to do for you?

Data Implementation Models

Hierarchical model
Network model
Relational Model
- Data model - schema
- Transactional process CRUD
- Protect from corruption
- OLTP - online transaction processing
- Most queries tend to be the same
- Row oriented data
Data warehouse
- OLAP - online analytic processing
- Dimensional model
- Read only queries
- Schema required
Nosql data bases
- Not only SQL
- Key-value store
- Document store
  - Semistructure - email
- Columnar formats
  - AVRO, ORC PARQUET

Data Modeling

- what does the data look, what does it mean, how is it organized
- https://www.dama.org/cpages/home

Data Prep

- Data managment - data as corporate ase
- Data compiant -

Data use

- Data Analyst - What can we learn from the data
- Business Intel - How can tune our business
- Data Mining / ML - cluster analysis
- MLOps

Cloud 101

DevOps, Infrastructure as code
Google Cloud
- Colossus - Internal Storge
- Jupiter - petabit network

Class Project

Global Warming

Data Quality

Fnu Katheswaranath
Joashua Rodson
Amber Roddottir

NoSQL

Key value store
Document database - semi structure
- MongoDB
Columnar Storage
- Sparce data set

index colum  [ca: [....], cz: [.....]]

Data Drift
Concept drift

Nature based - emulating natural behaviours - emulate natural cognition Rational AI - developing ML models that are use - Rational actors

Datraproc
- specificy parameter for a hdfs cluster

Resources

Public Examples Repository

https://github.com/GoogleCloudPlatform/training-data-analyst
There are other repos in github.com/GoogleCloudPlatform that you might want to explore as well

Coursera

Google offers their certification training through Coursera as a series of courses. Info on that is here.
https://www.coursera.org/specializations/gcp-data-machine-learning

Practice exam

Free example exam questions for the certification are at:
https://gcp-examquestions.com/course/google-professional-cloud-data-engineer-practice-exam/
https://www.vmexam.com/google/google-gcp-pde-certification-exam-sample-questions

References

The following are the all references pulled from the class materials

Module 1-01 - Data Engineering

Module 1-02 - Building a Data Lake

Module 1-03 Building a data warehouse

Module 2-01 Introduction to Building Batch Data Pipelines

Demo: https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/data-engineering/demos/simple_healthcheck.md

Module 2-02 Batch Processing of Data with Spark and Hadoop on GCP_M2 - Executing Spark on Cloud Dataproc

Module 2-03 Serverless Data Processing with Cloud Dataflow.

Module 2-04 Manage Data Pipelines with Cloud Data Fusion and Cloud Composer

Module 3-01 Introduction to Processing Streaming Data

None

Module 3-02 Serverless Messaging with Cloud Pub_Sub

None

Module 3-03 Cloud Dataflow Streaming Features

https://beam.apache.org/documentation/programming-guide/#windowing-basics

Module 3-04 High-Throughput BigQuery and Bigtable Streaming Features

Module 3-05 Advanced BigQuery Functionality and Performance

Module 4-01 Introduction to Analytics and AI

Module 4-02 Prebuilt ML model APIs for Unstructured Data

Module 4-03 Big Data Analytics with Cloud AI Platform Notebooks

None

Module 4-04 Production ML Pipelines with Kubeflow

None

Module 4-05 Custom Model building with SQL in BigQuery ML

Demo: https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/data-engineering/demos/predict_taxi_bigqueryml.md

Module 4-06 Custom Model building with Cloud AutoML

https://cloud.google.com/natural-language/automl/docs/evaluate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Engineering Class

Data Engineering on Google Cloud - 5 Days

Links

Tutorials and demos

Admin Stuff

Introductions

Data Implementation Models

Data Modeling

Data Prep

Data use

Cloud 101

Class Project

Data Quality

NoSQL

Resources

Public Examples Repository

Coursera

Practice exam

References

Module 1-01 - Data Engineering

Module 1-02 - Building a Data Lake

Module 1-03 Building a data warehouse

Module 2-01 Introduction to Building Batch Data Pipelines

Module 2-02 Batch Processing of Data with Spark and Hadoop on GCP_M2 - Executing Spark on Cloud Dataproc

Module 2-03 Serverless Data Processing with Cloud Dataflow.

Module 2-04 Manage Data Pipelines with Cloud Data Fusion and Cloud Composer

Module 3-01 Introduction to Processing Streaming Data

Module 3-02 Serverless Messaging with Cloud Pub_Sub

Module 3-03 Cloud Dataflow Streaming Features

Module 3-04 High-Throughput BigQuery and Bigtable Streaming Features

Module 3-05 Advanced BigQuery Functionality and Performance

Module 4-01 Introduction to Analytics and AI

Module 4-02 Prebuilt ML model APIs for Unstructured Data

Module 4-03 Big Data Analytics with Cloud AI Platform Notebooks

Module 4-04 Production ML Pipelines with Kubeflow

Module 4-05 Custom Model building with SQL in BigQuery ML

Module 4-06 Custom Model building with Cloud AutoML

FilesExpand file tree

ClassNotes.md

Latest commit

History

ClassNotes.md

File metadata and controls

Data Engineering Class

Data Engineering on Google Cloud - 5 Days

Links

Tutorials and demos

Admin Stuff

Introductions

Data Implementation Models

Data Modeling

Data Prep

Data use

Cloud 101

Class Project

Data Quality

NoSQL

Resources

Public Examples Repository

Coursera

Practice exam

References

Module 1-01 - Data Engineering

Module 1-02 - Building a Data Lake

Module 1-03 Building a data warehouse

Module 2-01 Introduction to Building Batch Data Pipelines

Module 2-02 Batch Processing of Data with Spark and Hadoop on GCP_M2 - Executing Spark on Cloud Dataproc

Module 2-03 Serverless Data Processing with Cloud Dataflow.

Module 2-04 Manage Data Pipelines with Cloud Data Fusion and Cloud Composer

Module 3-01 Introduction to Processing Streaming Data

Module 3-02 Serverless Messaging with Cloud Pub_Sub

Module 3-03 Cloud Dataflow Streaming Features

Module 3-04 High-Throughput BigQuery and Bigtable Streaming Features

Module 3-05 Advanced BigQuery Functionality and Performance

Module 4-01 Introduction to Analytics and AI

Module 4-02 Prebuilt ML model APIs for Unstructured Data

Module 4-03 Big Data Analytics with Cloud AI Platform Notebooks

Module 4-04 Production ML Pipelines with Kubeflow

Module 4-05 Custom Model building with SQL in BigQuery ML

Module 4-06 Custom Model building with Cloud AutoML