Skip to content

fix(dataproc): ensure staging dags run dataproc with default service account#212

Merged
DSuveges merged 1 commit into
devfrom
genetics-dev-sa
Jun 23, 2026
Merged

fix(dataproc): ensure staging dags run dataproc with default service account#212
DSuveges merged 1 commit into
devfrom
genetics-dev-sa

Conversation

@project-defiant

@project-defiant project-defiant commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Context

This is follow up on #205 which aligned the usage of dataproc operators across UnifiedPipeline and staging dags. The core change there was to remove duplicated codebase for creating, deleting and running cluster steps. This caused following change:

generate_dataproc_task_chain - function that is called by staging dags, now constructs a CustomClusterConfig directly from the caller's cluster_config dict. The CustomClusterConfig Pydantic model declares service_account with a module-level default

GCP_SERVICE_ACCOUNT = "up-airflow-dev@open-targets-eu-dev.iam.gserviceaccount.com"

Issue

Staging DAGs that do not explicitly include service_account in their config dict (see yaml configs) were silently inheriting this Airflow dev service account. The issue with that is the staging dags are meant to be run on the open-targets-genetics-dev project instead of running under the Dataproc default compute service account. This caused IAM-related failures at cluster creation time when the Airflow SA lacked the permissions expected in the staging context.

The same gap existed for internal_ip_only: staging DAGs omitting that key fell through to Pydantic's None default, whereas the intended baseline is False.

Implementations

Before unpacking cluster_config into CustomClusterConfig, inject safe sentinel defaults for the two fields via dict.setdefault():

  kwargs["cluster_config"].setdefault("service_account", None)   # use Dataproc default compute SA
  kwargs["cluster_config"].setdefault("internal_ip_only", False).   # allow for internet access by default

setdefault is a no-op when the key is already present, so production DAGs that explicitly name a service account are unaffected.

@project-defiant project-defiant requested a review from DSuveges June 23, 2026 10:51
@DSuveges DSuveges merged commit e617ef7 into dev Jun 23, 2026
2 checks passed
@DSuveges DSuveges deleted the genetics-dev-sa branch June 23, 2026 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants