Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions google_bigquery_syndicated_dataset/.terraform-docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
content: |-
{{ .Header }}

## Example

```hcl
module "treeherder" {
source = "github.com/mozilla/terraform-modules//google_bigquery_syndicated_dataset?ref=main"

dataset_id = "for_treeherder_1"
syndicated_dataset_id = "treeherder_db"
realm = var.realm

access = [
{ role = "OWNER", special_group = "projectOwners" },
# projectReaders/projectWriters usage is discouraged, see DSRE-1497
{ role = "READER", special_group = "projectReaders" },
{ role = "WRITER", special_group = "projectWriters" },
]
}
```

{{ .Requirements }}

{{ .Providers }}

{{ .Modules }}

{{ .Resources }}

{{ .Inputs }}

{{ .Outputs }}
109 changes: 109 additions & 0 deletions google_bigquery_syndicated_dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
<!-- BEGIN_TF_DOCS -->
# google\_bigquery\_syndicated\_dataset

Creates a BigQuery dataset configured for syndication to Mozilla Data Platform
infrastructure (mozdata and data-shared projects). This module is meant to
simplify the steps in [Importing Data from OLTP Databases to BigQuery via Federated Queries](https://mozilla-hub.atlassian.net/wiki/spaces/IP/pages/473727279/Importing+Data+from+OLTP+Databases+to+BigQuery+via+Federated+Queries)

This module abstracts away the syndication boilerplate:
- Resolves syndication service accounts via workgroup
- Looks up the org custom role for syndication
- Auto-discovers whether syndicated datasets exist in data platform projects
- Adds dataset authorizations only when targets exist

## Target Inference

The `syndicated_dataset_id` (defaults to `dataset_id`) determines targets:
- Does NOT end in `_syndicate` → user-facing → both mozdata and data-shared
- Ends in `_syndicate` → data-shared only
- Eventually the syndication datasets themselves will be inferred from bqetl metadata available to all MozCloud tenant infrastructure

## State propagation

While this module reduces the amount of PRs required to set up syndication, it will not automatically
propagate those changes. You still need to follow the steps on
https://mozilla-hub.atlassian.net/wiki/spaces/SRE/pages/27924945/Atlantis+-+Terraform+Automation#Invoking-Atlantis-without-terraform-changes
in order to authorize datasets on the tenant infra side. Eventually policy-as-code and drift
detection automation will make these manual steps unnecessary.

## Example

```hcl
module "treeherder" {
source = "github.com/mozilla/terraform-modules//google_bigquery_syndicated_dataset?ref=main"

dataset_id = "for_treeherder_1"
syndicated_dataset_id = "treeherder_db"
realm = var.realm

access = [
{ role = "OWNER", special_group = "projectOwners" },
# projectReaders/projectWriters usage is discouraged, see DSRE-1497
{ role = "READER", special_group = "projectReaders" },
{ role = "WRITER", special_group = "projectWriters" },
]
}
```

## Requirements

| Name | Version |
|------|---------|
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >= 1.0 |
| <a name="requirement_google"></a> [google](#requirement\_google) | >= 4.0 |

## Providers

| Name | Version |
|------|---------|
| <a name="provider_google"></a> [google](#provider\_google) | >= 4.0 |
| <a name="provider_terraform"></a> [terraform](#provider\_terraform) | n/a |

## Modules

| Name | Source | Version |
|------|--------|---------|
| <a name="module_syndication_workgroup"></a> [syndication\_workgroup](#module\_syndication\_workgroup) | github.com/mozilla/terraform-modules//mozilla_workgroup | main |

## Resources

| Name | Type |
|------|------|
| [google_bigquery_dataset.dataset](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_dataset) | resource |
| [google_bigquery_dataset_access.syndicated_authorization](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_dataset_access) | resource |
| [google_bigquery_dataset_access.syndication_role](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_dataset_access) | resource |
| [terraform_remote_state.org](https://registry.terraform.io/providers/hashicorp/terraform/latest/docs/data-sources/remote_state) | data source |
| [terraform_remote_state.syndication_target](https://registry.terraform.io/providers/hashicorp/terraform/latest/docs/data-sources/remote_state) | data source |

## Inputs

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_access"></a> [access](#input\_access) | Application-specific access blocks for this dataset. projectOwners OWNER access is included by default unless disable\_project\_owners\_access is set. | <pre>set(object({<br/> role = optional(string)<br/> user_by_email = optional(string)<br/> group_by_email = optional(string)<br/> special_group = optional(string)<br/> domain = optional(string)<br/> iam_member = optional(string)<br/> dataset = optional(object({<br/> dataset = object({<br/> project_id = string<br/> dataset_id = string<br/> })<br/> target_types = list(string)<br/> }))<br/> view = optional(object({<br/> project_id = string<br/> dataset_id = string<br/> table_id = string<br/> }))<br/> }))</pre> | `[]` | no |
| <a name="input_create_dataset"></a> [create\_dataset](#input\_create\_dataset) | Whether to create the BigQuery dataset. Set to false to only manage syndication access on an existing dataset. | `bool` | `true` | no |
| <a name="input_dataset_id"></a> [dataset\_id](#input\_dataset\_id) | A unique ID for this dataset, without the project name. | `string` | n/a | yes |
| <a name="input_default_partition_expiration_ms"></a> [default\_partition\_expiration\_ms](#input\_default\_partition\_expiration\_ms) | The default partition expiration for all partitioned tables, in milliseconds. | `number` | `null` | no |
| <a name="input_default_table_expiration_ms"></a> [default\_table\_expiration\_ms](#input\_default\_table\_expiration\_ms) | The default lifetime of all tables in the dataset, in milliseconds. | `number` | `null` | no |
| <a name="input_delete_contents_on_destroy"></a> [delete\_contents\_on\_destroy](#input\_delete\_contents\_on\_destroy) | If true, delete all tables in the dataset when destroying the resource. | `bool` | `false` | no |
| <a name="input_description"></a> [description](#input\_description) | A user-friendly description of the dataset. | `string` | `null` | no |
| <a name="input_disable_project_owners_access"></a> [disable\_project\_owners\_access](#input\_disable\_project\_owners\_access) | Disable the implied projectOwners OWNER access on this dataset. This should almost never be set. | `bool` | `false` | no |
| <a name="input_friendly_name"></a> [friendly\_name](#input\_friendly\_name) | A descriptive name for the dataset. | `string` | `null` | no |
| <a name="input_labels"></a> [labels](#input\_labels) | Labels to apply to the dataset. | `map(string)` | `{}` | no |
| <a name="input_location"></a> [location](#input\_location) | The geographic location where the dataset should reside. | `string` | `"US"` | no |
| <a name="input_max_time_travel_hours"></a> [max\_time\_travel\_hours](#input\_max\_time\_travel\_hours) | Defines the time travel window in hours. | `number` | `null` | no |
| <a name="input_realm"></a> [realm](#input\_realm) | Source infrastructure realm. | `string` | n/a | yes |
| <a name="input_syndicated_dataset_id"></a> [syndicated\_dataset\_id](#input\_syndicated\_dataset\_id) | Name of the dataset in target projects. Defaults to dataset\_id. If name ends in '\_syndicate', only data-shared is targeted (no mozdata). | `string` | `null` | no |
| <a name="input_syndication_workgroup_ids"></a> [syndication\_workgroup\_ids](#input\_syndication\_workgroup\_ids) | Workgroup identifiers for service accounts that perform syndication. | `list(string)` | <pre>[<br/> "workgroup:dataplatform/jenkins"<br/>]</pre> | no |
| <a name="input_target_realm"></a> [target\_realm](#input\_target\_realm) | Target realm for syndication. Defaults to realm. Set override, e.g. nonprod source syndicating to prod targets. | `string` | `null` | no |

## Outputs

| Name | Description |
|------|-------------|
| <a name="output_dataset_id"></a> [dataset\_id](#output\_dataset\_id) | The dataset ID. |
| <a name="output_id"></a> [id](#output\_id) | The fully-qualified dataset ID (projects/PROJECT/datasets/DATASET). |
| <a name="output_self_link"></a> [self\_link](#output\_self\_link) | The URI of the created resource. |
| <a name="output_syndication_role_id"></a> [syndication\_role\_id](#output\_syndication\_role\_id) | The custom role ID used for syndication access. |
| <a name="output_syndication_service_accounts"></a> [syndication\_service\_accounts](#output\_syndication\_service\_accounts) | The service account emails used for syndication. |
| <a name="output_syndication_targets_active"></a> [syndication\_targets\_active](#output\_syndication\_targets\_active) | Map of syndication target names to whether authorized dataset access is active. |
<!-- END_TF_DOCS -->
218 changes: 218 additions & 0 deletions google_bigquery_syndicated_dataset/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
/**
* # google_bigquery_syndicated_dataset
*
* Creates a BigQuery dataset configured for syndication to Mozilla Data Platform
* infrastructure (mozdata and data-shared projects). This module is meant to
* simplify the steps in [Importing Data from OLTP Databases to BigQuery via Federated Queries](https://mozilla-hub.atlassian.net/wiki/spaces/IP/pages/473727279/Importing+Data+from+OLTP+Databases+to+BigQuery+via+Federated+Queries)
*
* This module abstracts away the syndication boilerplate:
* - Resolves syndication service accounts via workgroup
* - Looks up the org custom role for syndication
* - Auto-discovers whether syndicated datasets exist in data platform projects
* - Adds dataset authorizations only when targets exist
*
* ## Target Inference
*
* The `syndicated_dataset_id` (defaults to `dataset_id`) determines targets:
* - Does NOT end in `_syndicate` → user-facing → both mozdata and data-shared
* - Ends in `_syndicate` → data-shared only
* - Eventually the syndication datasets themselves will be inferred from bqetl metadata available to all MozCloud tenant infrastructure
*
* ## State propagation
*
* While this module reduces the amount of PRs required to set up syndication, it will not automatically
* propagate those changes. You still need to follow the steps on
* https://mozilla-hub.atlassian.net/wiki/spaces/SRE/pages/27924945/Atlantis+-+Terraform+Automation#Invoking-Atlantis-without-terraform-changes
* in order to authorize datasets on the tenant infra side. Eventually policy-as-code and drift
* detection automation will make these manual steps unnecessary.
*
*/

locals {
target_realm = coalesce(var.target_realm, var.realm)
syndicated_dataset_id = coalesce(var.syndicated_dataset_id, var.dataset_id)
is_user_facing = !endswith(local.syndicated_dataset_id, "_syndicate")

target_env = local.target_realm == "prod" ? "prod" : "stage"

# Syndication target configuration: data-shared always, mozdata only for user-facing datasets
target_config = merge(
{
data-shared = {
project_ids = { prod = "moz-fx-data-shared-prod", nonprod = "moz-fx-data-shar-nonprod-efed" }
state_path = "bigquery-new"
}
},
local.is_user_facing ? {
mozdata = {
project_ids = { prod = "mozdata", nonprod = "mozdata-nonprod" }
state_path = "bigquery"
}
} : {}
)

targets = {
for name, cfg in local.target_config :
name => {
project_id = cfg.project_ids[local.target_realm]
state_prefix = "projects/${name}/${local.target_realm}/envs/${local.target_env}/${cfg.state_path}"
}
}
}

# Remote state from syndication targets to check if datasets exist
data "terraform_remote_state" "syndication_target" {
for_each = local.targets

backend = "gcs"

config = {
bucket = "${each.value.project_id}-tf"
prefix = each.value.state_prefix
}
}

locals {
# Authorized dataset access for targets where the syndicated dataset exists
syndication_dataset_access = [
for name, target in local.targets : {
project_id = target.project_id
dataset_id = local.syndicated_dataset_id
}
if contains(
values(data.terraform_remote_state.syndication_target[name].outputs.syndicate_datasets),
local.syndicated_dataset_id
)
]
}

data "terraform_remote_state" "org" {
backend = "gcs"

config = {
bucket = "moz-fx-platform-mgmt-global-tf"
prefix = "projects/org"
}
}

# Service accounts that perform syndication
# Currently Jenkins with plans to move to Airflow, see https://mozilla-hub.atlassian.net/browse/SVCSE-3005
module "syndication_workgroup" {
source = "github.com/mozilla/terraform-modules//mozilla_workgroup?ref=main"
ids = var.syndication_workgroup_ids
# TODO this config will need to be removed when SVCSE-4008 is complete
terraform_remote_state_bucket = "moz-fx-data-terraform-state-global"
terraform_remote_state_prefix = "projects/data-shared/global/access-groups"
}

resource "google_bigquery_dataset" "dataset" {
count = var.create_dataset ? 1 : 0

dataset_id = var.dataset_id
location = var.location
friendly_name = var.friendly_name
description = var.description
labels = var.labels
default_table_expiration_ms = var.default_table_expiration_ms
default_partition_expiration_ms = var.default_partition_expiration_ms
max_time_travel_hours = var.max_time_travel_hours
delete_contents_on_destroy = var.delete_contents_on_destroy

# projectOwners access is implied unless explicitly disabled
dynamic "access" {
for_each = var.disable_project_owners_access ? [] : [1]
content {
role = "OWNER"
special_group = "projectOwners"
}
}

# App-specific IAM access
dynamic "access" {
for_each = [for a in var.access : a if a.role != null && a.dataset == null && a.view == null]
content {
role = access.value.role
user_by_email = access.value.user_by_email
group_by_email = access.value.group_by_email
special_group = access.value.special_group
domain = access.value.domain
iam_member = access.value.iam_member
}
}

# App-specific non-syndicate authorized dataset access
dynamic "access" {
for_each = [for a in var.access : a if a.dataset != null]
content {
dataset {
dataset {
project_id = access.value.dataset.dataset.project_id
dataset_id = access.value.dataset.dataset.dataset_id
}
target_types = access.value.dataset.target_types
}
}
}

# App-specific authorized views
dynamic "access" {
for_each = [for a in var.access : a if a.view != null]
content {
view {
project_id = access.value.view.project_id
dataset_id = access.value.view.dataset_id
table_id = access.value.view.table_id
}
}
}

# Syndication service account access
dynamic "access" {
for_each = module.syndication_workgroup.service_accounts
content {
role = data.terraform_remote_state.org.outputs.bigquery_jobs_manage_syndicate_dataset_role_id
user_by_email = access.value
}
}

# Syndication authorized dataset access for syndicates
dynamic "access" {
for_each = local.syndication_dataset_access
content {
dataset {
dataset {
project_id = access.value.project_id
dataset_id = access.value.dataset_id
}
target_types = ["VIEWS"]
}
}
}
}

# Non-authoritative syndication access for externally-managed datasets
resource "google_bigquery_dataset_access" "syndication_role" {
for_each = var.create_dataset ? {} : {
for sa in module.syndication_workgroup.service_accounts : sa => sa
}

dataset_id = var.dataset_id
role = data.terraform_remote_state.org.outputs.bigquery_jobs_manage_syndicate_dataset_role_id
user_by_email = each.value
}

resource "google_bigquery_dataset_access" "syndicated_authorization" {
for_each = var.create_dataset ? {} : {
for entry in local.syndication_dataset_access : "${entry.project_id}/${entry.dataset_id}" => entry
}

dataset_id = var.dataset_id

dataset {
dataset {
project_id = each.value.project_id
dataset_id = each.value.dataset_id
}
target_types = ["VIEWS"]
}
}
Loading