Skip to content

CUMULUS-4553 Centralize schema generation#4295

Open
reweeden wants to merge 7 commits into
masterfrom
rew/CUMULUS-4553-centralize-schemas
Open

CUMULUS-4553 Centralize schema generation#4295
reweeden wants to merge 7 commits into
masterfrom
rew/CUMULUS-4553-centralize-schemas

Conversation

@reweeden

@reweeden reweeden commented Mar 9, 2026

Copy link
Copy Markdown
Contributor

Summary: Refactors all python task schemas to use pydantic to generate the jsonschemas. The functionality for doing the generation is encapsulated in the python-schemas package which is referenced as a dev dependency. This package also includes some reusable schemas that can be imported when needed across multiple tasks e.g. Granule or Cnm.

Task schemas can be generated for the whole repo using:

lerna run generate-task-schemas

This will regenerate the jsonschemas for both the python and typescript tasks.

Addresses CUMULUS-4553: Centralize Coreification input/output schemas

Changes

  • Add stub package for shared development dependencies
  • Add python-schemas package for converting pydantic models to jsonschema and housing reusable models needed by multiple tasks.
  • Add uv-export script to package.jsons so that requirements.txt can be updated with lerna run uv-export
  • Add uv-lock script to package.jsons to upgrade the dependencies in all lock files with lerna run uv-lock
  • Fix an issue where schema validation wasn't happening when the python tasks were deployed to lambda, allowing some of the integration tests to pass when they should have failed with validation errors.

PR Checklist

  • Update CHANGELOG
  • Unit tests
  • Ad-hoc testing - Deploy changes and test manually
  • Integration tests

📝 Note:
For most pull requests, please Squash and merge to maintain a clean and readable commit history.

@reweeden reweeden force-pushed the rew/CUMULUS-4553-centralize-schemas branch 18 times, most recently from a42b0d3 to 3fec667 Compare March 16, 2026 19:14
@reweeden reweeden force-pushed the rew/CUMULUS-4553-centralize-schemas branch from 3fec667 to 05d723f Compare March 19, 2026 15:12
@reweeden reweeden marked this pull request as ready for review March 19, 2026 16:42

@mikedorfman mikedorfman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to get partway through this - wanted to post my comments before CoB for me. I'll try to get the rest of the way through it either tomorrow or Monday.

Comment thread packages/python-common-dev-dependencies/README.md

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming my understanding - this is duplicating the definition here, right? I'm not sure where the authoritative definition/schema is for CNM and what references the podaac schema, but we might want processes pointed there to point here at some point so we have a single source of truth.

@reweeden reweeden Mar 19, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea true. I believe that @yenes56 originally generated the pydantic models from that schema you linked. I went through and cleaned them up a little by hand because the auto generation was a little crude and duplicated a bunch of stuff that I was able to combine.

But that is a good question, how do we ensure that updates to the CNM schema are reflected here? One thing I'm a bit concerned with is the version number enum. Let's say a new version of CNM is released, tasks that use this schema will reject that new version with a validation error until the schemas are updated, and a new version of cumulus is released even if structurally nothing has changed in the schema.

print(f"Wrote schema to {output_path}")

if encountered_error:
sys.exit(-1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could just do raise?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I didn't want it to spew a python exception trace to the screen.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we're not generating/saving a uv lockfile here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, for libraries it doesn't make sense to commit the lock file since the library itself doesn't have just one possible environment. It has a range of compatible versions which will eventually be resolved to a particular environment in the application that is using the library. In this case every task that pulls this library in as a dev dependency has a lock file in which the particular environment for that task will be captured. It's possible that each task gets slightly different versions of the dependencies based on what is compatible with everything else in the dependency tree.

"""
LOGGER.setMetadata(event, context)
cumulus_task_return = run_cumulus_task(lambda_adapter, event, context, SCHEMAS)
cumulus_task_return = run_cumulus_task(lambda_adapter, event, context)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason you're using the implicit/path-based schema validation instead of passing the paths to run_cumulus_task? I really wish run_cumulus_task complained/failed if it didn't find the schemas (particularly when passing the paths explicitly).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I figured we should be sticking to the same default paths that the rest of the core tasks use and it would be less likely for someone to be tempted to do it differently if we just use the implicit defaults. Not exactly sure, it felt redundant to me to pass them in explicitly.

Yea, if it failed when they were passed in explicitly that might be a reason to do it.

@reweeden reweeden force-pushed the rew/CUMULUS-4553-centralize-schemas branch from 05d723f to bfe8686 Compare March 23, 2026 18:40
@reweeden reweeden force-pushed the rew/CUMULUS-4553-centralize-schemas branch from bfe8686 to e75fa31 Compare March 30, 2026 16:59
@reweeden reweeden force-pushed the rew/CUMULUS-4553-centralize-schemas branch from e75fa31 to 35ecc05 Compare March 30, 2026 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants