Skip to content

Add check_nan_every to CI configs#1916

Merged
hgpeterson merged 1 commit into
mainfrom
hp/check_nans
May 12, 2026
Merged

Add check_nan_every to CI configs#1916
hgpeterson merged 1 commit into
mainfrom
hp/check_nans

Conversation

@hgpeterson
Copy link
Copy Markdown
Member

Closes #1915

@hgpeterson
Copy link
Copy Markdown
Member Author

Confirmed that the setup that erroneously passed last time correctly failed under these changes.

@hgpeterson hgpeterson marked this pull request as ready for review May 8, 2026 20:44
Copy link
Copy Markdown
Member

@szy21 szy21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that check nan is very slow. It would be good to check if this significantly increases ci time. If so we can increase check_nan_every for the jobs that run longer.

@hgpeterson
Copy link
Copy Markdown
Member Author

The timings between the two cases linked above don't seem too different (although for some reason the sims that fail with NaNs take much longer?).

@szy21
Copy link
Copy Markdown
Member

szy21 commented May 8, 2026

The timings between the two cases linked above don't seem too different (although for some reason the sims that fail with NaNs take much longer?).

cpu jobs can be a bit random as it depends on which node we get. But if it's not too different it's fine. We can always change later if ci becomes too long.

Copy link
Copy Markdown
Member

@juliasloan25 juliasloan25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thank you! Could you add some info to the "Debugging tips" section of the ClimaCoupler docs explaining that this option exists and what it does?

@juliasloan25
Copy link
Copy Markdown
Member

How does the run that you linked as failing now pass in this PR's CI? Is it because of the fix here?

@hgpeterson
Copy link
Copy Markdown
Member Author

How does the run that you linked as failing now pass in this PR's CI? Is it because of the fix here?

Yes, that run is fixed due to #1914. But I tested that this PR's changes make CI correctly fail pre-#1914 here.

@akshaysridhar
Copy link
Copy Markdown
Member

akshaysridhar commented May 11, 2026

Thanks @hgpeterson; following the above comment, if you could add a note to the "debugging" block that addresses this yaml config arg, we can squash and merge this PR. The CI timing seems to be at 1h24m now vs 1h11m prior to this callback frequency change.

@hgpeterson hgpeterson enabled auto-merge (squash) May 12, 2026 17:19
@hgpeterson hgpeterson merged commit fa286f8 into main May 12, 2026
12 checks passed
@hgpeterson hgpeterson deleted the hp/check_nans branch May 12, 2026 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ensure CI fails when NaNs occur

4 participants