Skip to content

Latest commit

 

History

History
184 lines (117 loc) · 12.7 KB

File metadata and controls

184 lines (117 loc) · 12.7 KB

Dataplex PoC

Dataplex is a data fabric from Google Cloud Platform (GCP) that “unifies distributed data and automates its management and governance.”.

Dataplex can function as a data catalog. It automatically extracts and updates metadata and schemas from data products, making data searchable and easily accessible through a central console. Automated data quality checks ensure the reliability of the data.

In this proof-of-concept project, we explored using Dataplex to build a PHAC data mesh and data catalog using Google Cloud Storage buckets and Big Query. We created lakes, zones and attached synthetic data products (assets) to examine the IAM flow and various capabilities.

Dataplex is fully managed, scalable, and operates on a global plane. This allows data products to be added to the data mesh or catalog within their domain-specific GCP projects, granting domain owners complete control over access. Consistent metadata governance is achieved through tag templates that can be used across all products.

How it works:

In the surveillance specific GCP project

  • Data owners store their data products in Google Cloud Storage Buckets (in formats like Parquet, Avro, CSV, line-delimited JSON, or ORC), or BigQuery tables. These formats allow for automated schema discovery. Other formats can be used, but their schemas won't be added to Dataplex. *Projects will come with Dataplex enabled through Infrastructure as Code (IaC), featuring one Lake and two zones (one for raw data and one for curated data). Data owners can attach the products they want to share to a Dataplex Zone as assets. Tags can be added to provide additional metadata, such as contact information for access requests. Note: This will only share the metadata and schema; users will still need to request and be granted access to view the data.
  • All users in PHAC will be granted data catalog user roles at the GCP Org folder level.
  • Users can search Dataplex catalog within their own project for data sources in other in the PHAC organization.
  • Data owners assign permission to the dataplex project's service account with the Dataplex Service Agent role on the bucket, Big Query table (or project wide). This will allow Dataplex permissions to the data to extract the schema and metadata, as well as to manage IAM to the data from within the dataplex project (access would be granted by the data owner for that asset/ zone).

How to start

Store Data Product

  1. Generate Data and store in a GCP project.
  • Store data in a Google Cloud Storage Bucket or Big Query. (Note there are connectors to store metadata from external sources, but this is out of scope for this proof of concept.)
  • Dataplex can auto-extract schemas from csv, ............ formated data products. But any format should work.
  • The tables need to be compatable with Biq Query's format requirements if schemas are to be auto-extracted. (even if stored in google cloud storage as it's extracted into Big Query)
    • Curated datasets require field name characters be in the set 0-9, _, a-z, A-Z - no spaces, brackets or hyphens are accepted.
    • Date values cannot have '/' - replace with hyphen.
  1. Enable service APIS for DataPlex
  • BigQuery API
  • Cloud Dataplex API ** there may be one more
  1. Add data assets to zones
    • If adding from bucket, bucket location needs to be same as lake/zone.
    • Data discovery will sync regularly to update Dataplex with the most recent schema/ metadata. You can set the frequecy of this sync.

Data Assets

  1. Table Entities

  2. Ensure the data was added without issue.

  3. Add tags to data asset entities

  • Use predefined tag templates to record data owner, contact information to request access to data and other important details such as classification, branch, etc. This can be edited at the field level.
  • Or, create a new tag template, ensure to specify required fields.
  • More than one tag can be added per field.
  • Indicate any PII.

Search for data

  1. Use the faceted search to seach by tag, i.e. unclassified, surveillance program, or field, i.e,
  • View the schema and metadata to see if this is what you Request access to data
  1. Use big query for analysis

Interesting Features

Data Quality and Profiling

Data Profiling

Generates statistics and spread for each field – option to export to BQ table Data Profiling

Can add assertions to datasets for data quality checks.

Uses Data Quality Engine for inplace validation on BQ tables and GCS structured data.

  • Enable dataproc API
  • Enable private google access for network/subnetwork
  • Can create specification yaml file and upload to a gcs bucket
  • used data profiling (statistical analysis of data - nulls, classes, sensitive etc and suggests ruls)duplicate, missing data, outliers
  • user defined rules and ql Data Quality 1 Data Quality 2

Tag and Tag Templates

Tag templates are used to have consistancy accorss all data products. Tags can be public (searchable) and private (only searchable/ viewable with permissions) tags.

Example Tag Template

Tags can be are applied to the entire table (applied to every column on the table), or individual columns. In the schema view, if you click on the tag, you can view the tag values.

Data Catalog

The Dataplex Metadata Role will be needed. Can search using this syntax. There's a UI you can also use filter for various tags (ie data owner/ contact, PII, column name, data type) data search

Nuances

  • Curated datasets require field name characters be in the set 0-9, _, a-z, A-Z - no spaces, brackets or hyphens are accepted.
  • Date values cannot have '/' - replace with hyphen.
  • Asset IDs must contain only lowercase letters, numbers, and/or hyphens

Working Resources