This repository contains a Terraform-based deployment bundle designed to showcase various Google Cloud Data Loss Prevention (DLP) use cases. It automates the provisioning of infrastructure to demonstrate DLP capabilities such as API inspection, automated storage classification, BigQuery masking/tokenization, and PDF redaction.
⚠️ Disclaimer: This code is intended for demonstration purposes only and is not meant for production workloads.
The Terraform script deploys a folder containing five distinct Google Cloud projects, each demonstrating a specific DLP capability:
| Project Name | Description | Key Tech Stack |
|---|---|---|
| 1. DLP API Calls | Deploys a Node.js app on a Compute Engine instance to demonstrate direct DLP API calls for string/file inspection and de-identification. | Compute Engine, Node.js, DLP API |
| 2. DLP Auto GCS Classification | Automates data classification. Files uploaded to a "Quarantine" bucket are scanned; sensitive files go to a secure bucket, non-sensitive to another. | Cloud Functions, Pub/Sub, GCS, DLP API |
| 3. DLP BigQuery UDF | Demonstrates Remote Functions (UDF) in BigQuery to de-identify, mask, and re-identify data dynamically using SQL queries. | BigQuery, Cloud Functions, KMS |
| 4. DLP BQ Findings Export | Scans a BigQuery dataset and exports findings to Security Command Center (SCC) and Dataplex. (Module disabled by default). | BigQuery, Eventarc, SCC, Dataplex |
| 5. DLP PDF Redaction | A serverless pipeline that automatically redacts sensitive information (PII) from uploaded PDF files. | Workflows, Cloud Run, Cloud Functions, GCS |
Each module deploys its own isolated environment. Below is a high-level summary of the workflows:
- API Inspection: Users SSH into a VM to run scripts that send payload data to the DLP API.
- GCS Classification:
Upload -> Quarantine Bucket -> Trigger Function -> DLP Scan -> Move to Target Bucket. - BigQuery UDF:
SQL Query -> Remote Function -> DLP Processor -> Return Masked/Tokenized Data. - PDF Redaction:
Upload PDF -> Trigger Workflow -> Split Pages -> Redact Images (DLP) -> Merge PDF -> Save Output.
Before deploying, ensure you have the following:
- Google Cloud Project/Organization: Access to a Google Cloud Organization (or a demo environment like Argolis).
- IAM Roles: You must have the following roles assigned to your user:
Billing Account UserFolder CreatorOrganization Role ViewerProject CreatorBilling User
- Tools:
- Google Cloud SDK (gcloud)
- Terraform (or use Cloud Shell)
Open Cloud Shell or your terminal and clone this repository:
git clone [https://github.com/mgaur10/dlp-demo-bundle.git](https://github.com/mgaur10/dlp-demo-bundle.git)
cd dlp-demo-bundle
Navigate to the bundle folder and edit the terraform.tfvars file. You must update the following values to match your environment:
organization_id = "YOUR_ORG_ID"
billing_account = "YOUR_BILLING_ACCOUNT_ID"
Run the following commands to provision the resources.
terraform init
terraform plan
terraform apply
Upon successful completion, Terraform will display a list of Outputs (green text). Copy these outputs; they contain the project IDs, bucket names, and commands you will need for the demos.
Goal: Inspect and redact strings/files via command line.
Go to Compute Engine in the DLP API Calls project.
SSH into dlp-demo-server using the command provided in the Terraform output (_module_dlp_api_02_iap_ssh_tunnel...).
Run the sample scripts (found in Terraform outputs):
Inspect Text: node /tmp/nodejs-dlp/samples/inspectString.js ...
Inspect File: node /tmp/nodejs-dlp/samples/inspectFile.js ...
Masking: node /tmp/nodejs-dlp/samples/deidentifyWithMask.js ...
Redaction: node /tmp/nodejs-dlp/samples/redactText.js ...
Goal: Upload files and watch them get sorted based on sensitivity.
Locate the Quarantine Bucket (dlp-demo-qa-xxxx) from the outputs.
Upload the sample data using the provided gsutil command output.
gsutil -m cp sample_data/*sample* gs://dlp-demo-qa-touk
Check the Sensitive Bucket (dlp-demo-sens-xxxx) and Non-Sensitive Bucket (dlp-demo-nonsens-xxxx). The files will automatically move to the correct bucket based on their content.
Goal: Use SQL to mask and re-identify PII.
Navigate to BigQuery in the DLP BigQuery UDF project.
Open the clear-data table to see the raw data.
De-identify (Tokenize): Run the query provided in the output (_module_dlp_bigquery_udf_02...). It will replace SSNs with tokens.
Save the result as a new table named udf-deid.
Re-identify (Decrypt): Run the query from output _module_dlp_bigquery_udf_03... against the udf-deid table to retrieve the original SSNs.
Masking: Run the query from output _module_dlp_bigquery_udf_01... to mask Credit Card numbers with asterisks.
Note: Ensure this module is enabled in base.tf if you wish to use it.
This project scans a BQ dataset upon job completion.
View findings in Security Command Center under the "Data Loss Prevention" source.
View data lineage and tag counts in Dataplex.
Goal: Upload a PDF resume and get a redacted copy.
Locate the Input Bucket (pdf-input-bucket-xxxx) in the DLP PDF Redaction project.
Upload a sample PDF (e.g., test_file.pdf) using the provided gsutil command.
Wait for the Cloud Workflow to finish.
Check the Output Bucket (pdf-output-bucket-xxxx) for test_file-redacted.pdf.
Open the redacted PDF to see PII (like Names, Phones) blacked out.
🧹 Clean Up To avoid incurring ongoing charges, destroy the resources once you are done with the demo:
terraform destroy