Skip to content

Add signed batch machine run context / ADR#1772

Open
krowvin wants to merge 18 commits into
developfrom
batch-dynamic-runtime/cda-job-context
Open

Add signed batch machine run context / ADR#1772
krowvin wants to merge 18 commits into
developfrom
batch-dynamic-runtime/cda-job-context

Conversation

@krowvin

@krowvin krowvin commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

This is the parent PR for the CWMS Batch Events M2M auth and dynamic runtime work.

CDA now supports the preferred production shape when Keycloak can mint the batch machine context directly into the normal access token. CDA consumes validated OIDC claims such as machine_auth and run_as_office, while still rejecting unregistered machine principals instead of auto-creating them.

The signed X-CWMS-Job-Context path remains in the design as a fallback for cases where Keycloak cannot provide the needed machine-run context without a custom extension. The current rollout avoids a Keycloak SPI by using office-scoped scheduler and runner service accounts.

Related Draft PRs

Area PR Notes
Batch Events registry/API/UI/dispatcher USACE-WaterManagement/cwms-batch-events#184 Script registry, command/file source, resource profiles, schedule timezones, local timeout enforcement, repo browser, runtime broker.
Shared runner image USACE/cwbi-wm-images#51 Runner fetches brokered env, mints CDA bearer token with office runner client, supports Python/Node/Java/shell.
cwms-python client HydrologicEngineeringCenter/cwms-python#295 Supports CDA_BEARER_TOKEN and BATCH_JOB_CONTEXT_TOKEN.
cwms-cli client HydrologicEngineeringCenter/cwms-cli#226 Routes CLI CDA sessions through batch-aware auth helper.
Airflow DAG config USACE/airflow-config#429 Registry-driven scheduled jobs, per-office scheduler clients, schedule timezone handling.
SWT proof jobs USACE-WaterManagement/swt-wm-cwbi-jobs#23 Office proof branch for bearer-token batch runtime.

Blocked by repository permissions / fork policy:

Area Local branch Status
AWS Batch CDK infra cwms-batch batch-dynamic-runtime/cdk-runtime-jobdefs Local branch is ready, but upstream rejected push and GitHub policy blocks forking cwbi-dev-infrastructure/cwms-batch.
Airflow infra secrets airflow batch-dynamic-runtime/airflow-batch-events-auth Local branch is ready, but upstream rejected push and GitHub policy blocks forking cwbi-dev-infrastructure/airflow.

Diagrams

System Overview

CWMS Batch M2M overview

Editable source: batch-m2m-overview.drawio

End User UI Flow

End user Batch Events UI flow

Editable source: batch-ui-job-flow.drawio

Airflow Scheduled Flow

Airflow scheduled Batch Events flow

Editable source: batch-airflow-scheduler-flow.drawio

Validation

CDA:

  • JAVA_HOME=C:\Program Files\Java\jdk-21 ./gradlew.bat :cwms-data-api:test --tests cwms.cda.security.BatchJobContextTest --no-daemon --stacktrace
  • JAVA_HOME=C:\Program Files\Java\jdk-21 ./gradlew.bat :cwms-data-api:integrationTests --tests cwms.cda.api.auth.OpenIdConnectTestIT --no-daemon --stacktrace
  • The integration test proves Keycloak service-account client_credentials tokens can carry machine_auth/run_as_office, CDA rejects unregistered machine principals, and CDA accepts registered office-scoped machine principals.

Cross-repo/local E2E evidence:

  • Batch Events local UI created and ran jobs using the new registry model.
  • A command-mode timeout job with timeoutMinutes=1 and a 90-second sleep failed locally with Local executor timeout after 60 seconds.
  • Airflow helper tests verify registry timezone evaluation, DST gap skip, and repeated local occurrence run-once behavior.
  • Batch Events UI build passed after Scripts Manager source/command/path browser/timezone changes.

Checklist

  • AI tools used

@krowvin krowvin changed the title Add signed batch machine run context Add signed batch machine run context / ADR Jun 9, 2026
@krowvin krowvin marked this pull request as ready for review June 9, 2026 12:03
@krowvin krowvin requested a review from MikeNeilson June 9, 2026 12:03

@MikeNeilson MikeNeilson left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm being a bit pushy here. AND I may in fact be wrong about what we can accomplish with Keycloak and thus may end up having to go this route (there's nothing fundmentally wrong with the design... other that too much reliance on properties, though that's an initial design problem not a new problem) but I think if we can get Keycloak to handle the load the transition is a lot simpler.

final String email = claims.get(EMAIL_CLAIM, String.class);
return dao.createUser(preferredUserName, oidcPrincipal, givenName, email);
DataApiPrincipal dataApiPrincipal = dao.createUser(preferredUserName, oidcPrincipal, givenName, email);
BatchJobContext.prepareContext(ctx, dataApiPrincipal);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the "batch" user is getting created randomly here, we have an issue, In the context of a batch process this should be a failure.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected in bc40f8b

public static final String REQUESTED_BY_ATTR = "BatchRequestedBy";
public static final String DISPATCH_SOURCE_ATTR = "BatchDispatchSource";

public static final String SECRET_PROPERTY = "cwms.dataapi.batch.jobContext.secret";

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is Batch becoming an issuer of secrets?

My understanding is that Keycloak would provide the JWT and CDA would just consume it.

if (username == null) {
return false;
}
String machineUsers = readSetting(MACHINE_USERS_PROPERTY, DEFAULT_MACHINE_USERS);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CDA should not need to know anything about this, the JWT provided by Keycloak can embed a claim of "machine-auth" or something and decisions made from that.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially when I did this I was trying to figure out how to make this work with only one service account in keycloak, I now have it where we can have a service account per office

Now Batch is not the issuer of that context in the preferred path. Keycloak mints the normal access token for a per-office service account.

throw new CwmsAuthException("Batch job context missing run_as_office",
HttpServletResponse.SC_UNAUTHORIZED);
}
ctx.attribute(RUN_AS_OFFICE_ATTR, office.toUpperCase(Locale.ROOT));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While appropriate to put in to a future logging context... these shouldn't need to be part of the Request/Response attributes. Downstream components needing to know that is a definite code smell.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CDA now consumes this from the validated Keycloak/OIDC access token via claims like machine_auth=true and run_as_office=<office>, rather than relying on Batch to issue a separate signed context.

The remaining CDA-side knowledge is just the claim contract and normal user/principal lookup. It still rejects unregistered machine principals instead of auto-creating them. Locally I verified this with per-office Keycloak service accounts that mint the claims directly into the access token.


Provide CWMS Data API with a trusted batch run context for jobs that execute through a shared machine identity.

Batch runtimes will authenticate to CDA with a service account (via Keycloak). Each job will also provide a signed context token that identifies the authorized job launch context, including the office for which the scheduler or API approved the run.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting up additional signing is going to be difficult to get right, and will involve yet-more CDK changes. Granted they won't be difficult.

I would like us to first determine if we can, in some way, setup keycloak to be able to receive the required information (such as the office, and specific job identification) and return it back in the already signed JWT access token.

Some of the other information can, and should, still be provided but it doesn't really required singing it's just informational. The office would be used to set the session context (or by the future authorization system) to appropriately limit operations.

e.g. office + job identification can be readily tied to a policy to appropriately limit operations..

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this some here

#1772 (comment)

@krowvin

krowvin commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator Author

CDA now takes a validated OIDC claims machine_auth and run_as_office

Stoped auto creating machine principle

After checking the Keycloak paths, I do not think we can get dynamic per-job values like run_as_office and job_id into the access token using normal realm/client configuration alone. Built in protocol mappers work for static claims, service account/user attributes, roles, audiences, etc but they do not give us a clean supported way for Batch to send trusted per job context during client_credentials and have Keycloak validate and mint it into the access token.

Keycloak has a way to do this via an SPI/provider, such as a custom protocol mapper or script mapper deployed into Keycloak. Keycloak SPI docs https://www.keycloak.org/docs/latest/server_development/index.html#_providers

I do not think CWBI is likely to accept that operationally. If they did, we would need an owner and process for packaging, deploying, versioning, and maintaining that custom provider inside their Keycloak infrastructure.

Using a signed X-CWMS-Job-Context looks to be the way to go when Keycloak cannot create dynamic job context for us. CDA validates the normal Keycloak token for the machine principal, then separately validates the dispatcher signed job context for the run office.

I'm working right now to verify all this works completely on localhost and make sure it works as expected

@krowvin krowvin requested a review from MikeNeilson June 26, 2026 04:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants