mozilla · whd · Mar 4, 2026 · Mar 5, 2026
diff --git a/google_bigquery_syndicated_dataset/.terraform-docs.yml b/google_bigquery_syndicated_dataset/.terraform-docs.yml
@@ -0,0 +1,33 @@
+content: |-
+  {{ .Header }}
+
+  ## Example
+
+  ```hcl
+  module "treeherder" {
+    source = "github.com/mozilla/terraform-modules//google_bigquery_syndicated_dataset?ref=main"
+
+    dataset_id            = "for_treeherder_1"
+    syndicated_dataset_id = "treeherder_db"
+    realm                 = var.realm
+
+    access = [
+      { role = "OWNER", special_group = "projectOwners" },
+      # projectReaders/projectWriters usage is discouraged, see DSRE-1497
+      { role = "READER", special_group = "projectReaders" },
+      { role = "WRITER", special_group = "projectWriters" },
+    ]
+  }
+  ```
+
+  {{ .Requirements }}
+
+  {{ .Providers }}
+
+  {{ .Modules }}
+
+  {{ .Resources }}
+
+  {{ .Inputs }}
+
+  {{ .Outputs }}
diff --git a/google_bigquery_syndicated_dataset/README.md b/google_bigquery_syndicated_dataset/README.md
@@ -0,0 +1,109 @@
+<!-- BEGIN_TF_DOCS -->
+# google\_bigquery\_syndicated\_dataset
+
+Creates a BigQuery dataset configured for syndication to Mozilla Data Platform
+infrastructure (mozdata and data-shared projects).  This module is meant to
+simplify the steps in [Importing Data from OLTP Databases to BigQuery via Federated Queries](https://mozilla-hub.atlassian.net/wiki/spaces/IP/pages/473727279/Importing+Data+from+OLTP+Databases+to+BigQuery+via+Federated+Queries)
+
+This module abstracts away the syndication boilerplate:
+- Resolves syndication service accounts via workgroup
+- Looks up the org custom role for syndication
+- Auto-discovers whether syndicated datasets exist in data platform projects
+- Adds dataset authorizations only when targets exist
+
+## Target Inference
+
+The `syndicated_dataset_id` (defaults to `dataset_id`) determines targets:
+- Does NOT end in `_syndicate` → user-facing → both mozdata and data-shared
+- Ends in `_syndicate` → data-shared only
+- Eventually the syndication datasets themselves will be inferred from bqetl metadata available to all MozCloud tenant infrastructure
+
+## State propagation
+
+While this module reduces the amount of PRs required to set up syndication, it will not automatically
+propagate those changes. You still need to follow the steps on
+https://mozilla-hub.atlassian.net/wiki/spaces/SRE/pages/27924945/Atlantis+-+Terraform+Automation#Invoking-Atlantis-without-terraform-changes
+in order to authorize datasets on the tenant infra side. Eventually policy-as-code and drift
+detection automation will make these manual steps unnecessary.
+
+## Example
+
+```hcl
+module "treeherder" {
+  source = "github.com/mozilla/terraform-modules//google_bigquery_syndicated_dataset?ref=main"
+
+  dataset_id            = "for_treeherder_1"
+  syndicated_dataset_id = "treeherder_db"
+  realm                 = var.realm
+
+  access = [
+    { role = "OWNER", special_group = "projectOwners" },
+    # projectReaders/projectWriters usage is discouraged, see DSRE-1497
+    { role = "READER", special_group = "projectReaders" },
+    { role = "WRITER", special_group = "projectWriters" },
+  ]
+}
+```
+
+## Requirements
+
+| Name | Version |
+|------|---------|
+| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >= 1.0 |
+| <a name="requirement_google"></a> [google](#requirement\_google) | >= 4.0 |
+
+## Providers
+
+| Name | Version |
+|------|---------|
+| <a name="provider_google"></a> [google](#provider\_google) | >= 4.0 |
+| <a name="provider_terraform"></a> [terraform](#provider\_terraform) | n/a |
+
+## Modules
+
+| Name | Source | Version |
+|------|--------|---------|
+| <a name="module_syndication_workgroup"></a> [syndication\_workgroup](#module\_syndication\_workgroup) | github.com/mozilla/terraform-modules//mozilla_workgroup | main |
+
+## Resources
+
+| Name | Type |
+|------|------|
+| [google_bigquery_dataset.dataset](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_dataset) | resource |
+| [google_bigquery_dataset_access.syndicated_authorization](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_dataset_access) | resource |
+| [google_bigquery_dataset_access.syndication_role](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_dataset_access) | resource |
+| [terraform_remote_state.org](https://registry.terraform.io/providers/hashicorp/terraform/latest/docs/data-sources/remote_state) | data source |
+| [terraform_remote_state.syndication_target](https://registry.terraform.io/providers/hashicorp/terraform/latest/docs/data-sources/remote_state) | data source |
+
+## Inputs
+
+| Name | Description | Type | Default | Required |
+|------|-------------|------|---------|:--------:|
+| <a name="input_access"></a> [access](#input\_access) | Application-specific access blocks for this dataset. projectOwners OWNER access is included by default unless disable\_project\_owners\_access is set. | <pre>set(object({<br/>    role           = optional(string)<br/>    user_by_email  = optional(string)<br/>    group_by_email = optional(string)<br/>    special_group  = optional(string)<br/>    domain         = optional(string)<br/>    iam_member     = optional(string)<br/>    dataset = optional(object({<br/>      dataset = object({<br/>        project_id = string<br/>        dataset_id = string<br/>      })<br/>      target_types = list(string)<br/>    }))<br/>    view = optional(object({<br/>      project_id = string<br/>      dataset_id = string<br/>      table_id   = string<br/>    }))<br/>  }))</pre> | `[]` | no |
+| <a name="input_create_dataset"></a> [create\_dataset](#input\_create\_dataset) | Whether to create the BigQuery dataset. Set to false to only manage syndication access on an existing dataset. | `bool` | `true` | no |
+| <a name="input_dataset_id"></a> [dataset\_id](#input\_dataset\_id) | A unique ID for this dataset, without the project name. | `string` | n/a | yes |
+| <a name="input_default_partition_expiration_ms"></a> [default\_partition\_expiration\_ms](#input\_default\_partition\_expiration\_ms) | The default partition expiration for all partitioned tables, in milliseconds. | `number` | `null` | no |
+| <a name="input_default_table_expiration_ms"></a> [default\_table\_expiration\_ms](#input\_default\_table\_expiration\_ms) | The default lifetime of all tables in the dataset, in milliseconds. | `number` | `null` | no |
+| <a name="input_delete_contents_on_destroy"></a> [delete\_contents\_on\_destroy](#input\_delete\_contents\_on\_destroy) | If true, delete all tables in the dataset when destroying the resource. | `bool` | `false` | no |
+| <a name="input_description"></a> [description](#input\_description) | A user-friendly description of the dataset. | `string` | `null` | no |
+| <a name="input_disable_project_owners_access"></a> [disable\_project\_owners\_access](#input\_disable\_project\_owners\_access) | Disable the implied projectOwners OWNER access on this dataset. This should almost never be set. | `bool` | `false` | no |
+| <a name="input_friendly_name"></a> [friendly\_name](#input\_friendly\_name) | A descriptive name for the dataset. | `string` | `null` | no |
+| <a name="input_labels"></a> [labels](#input\_labels) | Labels to apply to the dataset. | `map(string)` | `{}` | no |
+| <a name="input_location"></a> [location](#input\_location) | The geographic location where the dataset should reside. | `string` | `"US"` | no |
+| <a name="input_max_time_travel_hours"></a> [max\_time\_travel\_hours](#input\_max\_time\_travel\_hours) | Defines the time travel window in hours. | `number` | `null` | no |
+| <a name="input_realm"></a> [realm](#input\_realm) | Source infrastructure realm. | `string` | n/a | yes |
+| <a name="input_syndicated_dataset_id"></a> [syndicated\_dataset\_id](#input\_syndicated\_dataset\_id) | Name of the dataset in target projects. Defaults to dataset\_id. If name ends in '\_syndicate', only data-shared is targeted (no mozdata). | `string` | `null` | no |
+| <a name="input_syndication_workgroup_ids"></a> [syndication\_workgroup\_ids](#input\_syndication\_workgroup\_ids) | Workgroup identifiers for service accounts that perform syndication. | `list(string)` | <pre>[<br/>  "workgroup:dataplatform/jenkins"<br/>]</pre> | no |
+| <a name="input_target_realm"></a> [target\_realm](#input\_target\_realm) | Target realm for syndication. Defaults to realm. Set override, e.g. nonprod source syndicating to prod targets. | `string` | `null` | no |
+
+## Outputs
+
+| Name | Description |
+|------|-------------|
+| <a name="output_dataset_id"></a> [dataset\_id](#output\_dataset\_id) | The dataset ID. |
+| <a name="output_id"></a> [id](#output\_id) | The fully-qualified dataset ID (projects/PROJECT/datasets/DATASET). |
+| <a name="output_self_link"></a> [self\_link](#output\_self\_link) | The URI of the created resource. |
+| <a name="output_syndication_role_id"></a> [syndication\_role\_id](#output\_syndication\_role\_id) | The custom role ID used for syndication access. |
+| <a name="output_syndication_service_accounts"></a> [syndication\_service\_accounts](#output\_syndication\_service\_accounts) | The service account emails used for syndication. |
+| <a name="output_syndication_targets_active"></a> [syndication\_targets\_active](#output\_syndication\_targets\_active) | Map of syndication target names to whether authorized dataset access is active. |
+<!-- END_TF_DOCS -->
diff --git a/google_bigquery_syndicated_dataset/main.tf b/google_bigquery_syndicated_dataset/main.tf
@@ -0,0 +1,218 @@
+/**
+ * # google_bigquery_syndicated_dataset
+ *
+ * Creates a BigQuery dataset configured for syndication to Mozilla Data Platform
+ * infrastructure (mozdata and data-shared projects).  This module is meant to
+ * simplify the steps in [Importing Data from OLTP Databases to BigQuery via Federated Queries](https://mozilla-hub.atlassian.net/wiki/spaces/IP/pages/473727279/Importing+Data+from+OLTP+Databases+to+BigQuery+via+Federated+Queries)
+ *
+ * This module abstracts away the syndication boilerplate:
+ * - Resolves syndication service accounts via workgroup
+ * - Looks up the org custom role for syndication
+ * - Auto-discovers whether syndicated datasets exist in data platform projects
+ * - Adds dataset authorizations only when targets exist
+ *
+ * ## Target Inference
+ *
+ * The `syndicated_dataset_id` (defaults to `dataset_id`) determines targets:
+ * - Does NOT end in `_syndicate` → user-facing → both mozdata and data-shared
+ * - Ends in `_syndicate` → data-shared only
+ * - Eventually the syndication datasets themselves will be inferred from bqetl metadata available to all MozCloud tenant infrastructure
+ *
+ * ## State propagation
+ *
+ * While this module reduces the amount of PRs required to set up syndication, it will not automatically
+ * propagate those changes. You still need to follow the steps on
+ * https://mozilla-hub.atlassian.net/wiki/spaces/SRE/pages/27924945/Atlantis+-+Terraform+Automation#Invoking-Atlantis-without-terraform-changes
+ * in order to authorize datasets on the tenant infra side. Eventually policy-as-code and drift
+ * detection automation will make these manual steps unnecessary.
+ *
+ */
+
+locals {
+  target_realm          = coalesce(var.target_realm, var.realm)
+  syndicated_dataset_id = coalesce(var.syndicated_dataset_id, var.dataset_id)
+  is_user_facing        = !endswith(local.syndicated_dataset_id, "_syndicate")
+
+  target_env = local.target_realm == "prod" ? "prod" : "stage"
+
+  # Syndication target configuration: data-shared always, mozdata only for user-facing datasets
+  target_config = merge(
+    {
+      data-shared = {
+        project_ids = { prod = "moz-fx-data-shared-prod", nonprod = "moz-fx-data-shar-nonprod-efed" }
+        state_path  = "bigquery-new"
+      }
+    },
+    local.is_user_facing ? {
+      mozdata = {
+        project_ids = { prod = "mozdata", nonprod = "mozdata-nonprod" }
+        state_path  = "bigquery"
+      }
+    } : {}
+  )
+
+  targets = {
+    for name, cfg in local.target_config :
+    name => {
+      project_id   = cfg.project_ids[local.target_realm]
+      state_prefix = "projects/${name}/${local.target_realm}/envs/${local.target_env}/${cfg.state_path}"
+    }
+  }
+}
+
+# Remote state from syndication targets to check if datasets exist
+data "terraform_remote_state" "syndication_target" {
+  for_each = local.targets
+
+  backend = "gcs"
+
+  config = {
+    bucket = "${each.value.project_id}-tf"
+    prefix = each.value.state_prefix
+  }
+}
+
+locals {
+  # Authorized dataset access for targets where the syndicated dataset exists
+  syndication_dataset_access = [
+    for name, target in local.targets : {
+      project_id = target.project_id
+      dataset_id = local.syndicated_dataset_id
+    }
+    if contains(
+      values(data.terraform_remote_state.syndication_target[name].outputs.syndicate_datasets),
+      local.syndicated_dataset_id
+    )
+  ]
+}
+
+data "terraform_remote_state" "org" {
+  backend = "gcs"
+
+  config = {
+    bucket = "moz-fx-platform-mgmt-global-tf"
+    prefix = "projects/org"
+  }
+}
+
+# Service accounts that perform syndication
+# Currently Jenkins with plans to move to Airflow, see https://mozilla-hub.atlassian.net/browse/SVCSE-3005
+module "syndication_workgroup" {
+  source = "github.com/mozilla/terraform-modules//mozilla_workgroup?ref=main"
+  ids    = var.syndication_workgroup_ids
+  # TODO this config will need to be removed when SVCSE-4008 is complete
+  terraform_remote_state_bucket = "moz-fx-data-terraform-state-global"
+  terraform_remote_state_prefix = "projects/data-shared/global/access-groups"
+}
+
+resource "google_bigquery_dataset" "dataset" {
+  count = var.create_dataset ? 1 : 0
+
+  dataset_id                      = var.dataset_id
+  location                        = var.location
+  friendly_name                   = var.friendly_name
+  description                     = var.description
+  labels                          = var.labels
+  default_table_expiration_ms     = var.default_table_expiration_ms
+  default_partition_expiration_ms = var.default_partition_expiration_ms
+  max_time_travel_hours           = var.max_time_travel_hours
+  delete_contents_on_destroy      = var.delete_contents_on_destroy
+
+  # projectOwners access is implied unless explicitly disabled
+  dynamic "access" {
+    for_each = var.disable_project_owners_access ? [] : [1]
+    content {
+      role          = "OWNER"
+      special_group = "projectOwners"
+    }
+  }
+
+  # App-specific IAM access
+  dynamic "access" {
+    for_each = [for a in var.access : a if a.role != null && a.dataset == null && a.view == null]
+    content {
+      role           = access.value.role
+      user_by_email  = access.value.user_by_email
+      group_by_email = access.value.group_by_email
+      special_group  = access.value.special_group
+      domain         = access.value.domain
+      iam_member     = access.value.iam_member
+    }
+  }
+
+  # App-specific non-syndicate authorized dataset access
+  dynamic "access" {
+    for_each = [for a in var.access : a if a.dataset != null]
+    content {
+      dataset {
+        dataset {
+          project_id = access.value.dataset.dataset.project_id
+          dataset_id = access.value.dataset.dataset.dataset_id
+        }
+        target_types = access.value.dataset.target_types
+      }
+    }
+  }
+
+  # App-specific authorized views
+  dynamic "access" {
+    for_each = [for a in var.access : a if a.view != null]
+    content {
+      view {
+        project_id = access.value.view.project_id
+        dataset_id = access.value.view.dataset_id
+        table_id   = access.value.view.table_id
+      }
+    }
+  }
+
+  # Syndication service account access
+  dynamic "access" {
+    for_each = module.syndication_workgroup.service_accounts
+    content {
+      role          = data.terraform_remote_state.org.outputs.bigquery_jobs_manage_syndicate_dataset_role_id
+      user_by_email = access.value
+    }
+  }
+
+  # Syndication authorized dataset access for syndicates
+  dynamic "access" {
+    for_each = local.syndication_dataset_access
+    content {
+      dataset {
+        dataset {
+          project_id = access.value.project_id
+          dataset_id = access.value.dataset_id
+        }
+        target_types = ["VIEWS"]
+      }
+    }
+  }
+}
+
+# Non-authoritative syndication access for externally-managed datasets
+resource "google_bigquery_dataset_access" "syndication_role" {
+  for_each = var.create_dataset ? {} : {
+    for sa in module.syndication_workgroup.service_accounts : sa => sa
+  }
+
+  dataset_id    = var.dataset_id
+  role          = data.terraform_remote_state.org.outputs.bigquery_jobs_manage_syndicate_dataset_role_id
+  user_by_email = each.value
+}
+
+resource "google_bigquery_dataset_access" "syndicated_authorization" {
+  for_each = var.create_dataset ? {} : {
+    for entry in local.syndication_dataset_access : "${entry.project_id}/${entry.dataset_id}" => entry
+  }
+
+  dataset_id = var.dataset_id
+
+  dataset {
+    dataset {
+      project_id = each.value.project_id
+      dataset_id = each.value.dataset_id
+    }
+    target_types = ["VIEWS"]
+  }
+}