@noinline average_bulk_microphysics_tendencies to reduce register pressure#713
@noinline average_bulk_microphysics_tendencies to reduce register pressure#713petebachant wants to merge 5 commits into
@noinline average_bulk_microphysics_tendencies to reduce register pressure#713Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #713 +/- ##
=======================================
Coverage 92.19% 92.19%
=======================================
Files 55 55
Lines 2420 2420
=======================================
Hits 2231 2231
Misses 189 189
🚀 New features to boost your workflow:
|
dennisYatunin
left a comment
There was a problem hiding this comment.
Claude's explanation is a bit suspicious given that the quadrature loop isn't being unrolled, but a 10% speedup sounds great! I'll think about how we can turn this into a simpler example for ClimaCore's compiler stress tests.
|
Hypothesis: If Claude is correct, the issue comes from the quadrature loop being unrolled, so we de-unroll that we may be able to get these benefits without the performance hit on CPU. Might be possible by dropping quadrature order from the type. Move from type into the value. Type can be int. Value is number of quadrature loops. |
|
Is this PR something that should be merged or closed? |
|
I opened CliMA/ClimaAtmos.jl#4503 to retain the GPU performance gains and move changes to Atmos and avoid the 1.12 regression, but that one is a little uglier. Any preference from your end? |
|
No preference. Whichever option you think is better? |
|
Although I'm making some changes here: #717 where are grouped some output into a tuple. And Claude thinks that the performance will not be affected only if I keep inlining... |
|
I'm actually curious if the compiler will do a better job on either device if there is no macro on the function. I will give it a try, and if not, close this PR and focus on Atmos. It's not great to have to dig this deep into the stack for the performance gains. |
|
Running with no macro produced no performance change. Closing this and pursuing in Atmos. |
|
Thank you! |
This kernel is now the hottest in prog EDMF 1M AMIP by a long shot, and this change produces a ~10% speedup (kernel analysis notebook). Disclaimer: Explanatory comments written by Claude--I don't yet have a deep understanding of what's going on here!