Skip to content

@noinline average_bulk_microphysics_tendencies to reduce register pressure#713

Closed
petebachant wants to merge 5 commits into
mainfrom
pb/perf
Closed

@noinline average_bulk_microphysics_tendencies to reduce register pressure#713
petebachant wants to merge 5 commits into
mainfrom
pb/perf

Conversation

@petebachant
Copy link
Copy Markdown
Member

This kernel is now the hottest in prog EDMF 1M AMIP by a long shot, and this change produces a ~10% speedup (kernel analysis notebook). Disclaimer: Explanatory comments written by Claude--I don't yet have a deep understanding of what's going on here!

@petebachant petebachant requested a review from dennisYatunin May 8, 2026 00:06
@petebachant petebachant moved this to In review in Performance May 8, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.19%. Comparing base (5dd0a90) to head (2680d44).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #713   +/-   ##
=======================================
  Coverage   92.19%   92.19%           
=======================================
  Files          55       55           
  Lines        2420     2420           
=======================================
  Hits         2231     2231           
  Misses        189      189           
Components Coverage Δ
src 93.11% <100.00%> (ø)
ext 69.47% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@dennisYatunin dennisYatunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude's explanation is a bit suspicious given that the quadrature loop isn't being unrolled, but a 10% speedup sounds great! I'll think about how we can turn this into a simpler example for ClimaCore's compiler stress tests.

@petebachant petebachant self-assigned this May 11, 2026
@petebachant
Copy link
Copy Markdown
Member Author

petebachant commented May 11, 2026

Hypothesis: If Claude is correct, the issue comes from the quadrature loop being unrolled, so we de-unroll that we may be able to get these benefits without the performance hit on CPU.

Might be possible by dropping quadrature order from the type. Move from type into the value. Type can be int. Value is number of quadrature loops.

@trontrytel
Copy link
Copy Markdown
Member

Is this PR something that should be merged or closed?

@petebachant
Copy link
Copy Markdown
Member Author

I opened CliMA/ClimaAtmos.jl#4503 to retain the GPU performance gains and move changes to Atmos and avoid the 1.12 regression, but that one is a little uglier. Any preference from your end?

@trontrytel
Copy link
Copy Markdown
Member

No preference. Whichever option you think is better?

@trontrytel
Copy link
Copy Markdown
Member

Although I'm making some changes here: #717 where are grouped some output into a tuple. And Claude thinks that the performance will not be affected only if I keep inlining...

@petebachant
Copy link
Copy Markdown
Member Author

I'm actually curious if the compiler will do a better job on either device if there is no macro on the function. I will give it a try, and if not, close this PR and focus on Atmos. It's not great to have to dig this deep into the stack for the performance gains.

@petebachant petebachant marked this pull request as draft May 22, 2026 18:48
@petebachant
Copy link
Copy Markdown
Member Author

Running with no macro produced no performance change. Closing this and pursuing in Atmos.

@github-project-automation github-project-automation Bot moved this from In progress to Done in Performance May 22, 2026
@trontrytel
Copy link
Copy Markdown
Member

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants