Skip to content

Optimize op_conv_vef_face kernel#12

Draft
pzehner wants to merge 3 commits into
cea-trust-platform:nextfrom
pzehner:next-scratch
Draft

Optimize op_conv_vef_face kernel#12
pzehner wants to merge 3 commits into
cea-trust-platform:nextfrom
pzehner:next-scratch

Conversation

@pzehner

@pzehner pzehner commented Mar 28, 2025

Copy link
Copy Markdown
Contributor

This PR aims to optimize the large convective kernel in src/VEF/Operateurs/Op_Conv/Op_Conv_VEF_Face.cpp. It replaces temporary arrays in local memory by Kokkos views in scratch memory.

@pledac

pledac commented Mar 28, 2025

Copy link
Copy Markdown
Member

Thanks Paul, I will have a look monday.

@pledac

pledac commented Mar 28, 2025

Copy link
Copy Markdown
Member

For Op_Conv_VEF_Face kernel, we notice between 18% and 34% speedup (Nvidia A6000) according our GPU test cases.
On H100, the speedup drops between 6% and 14%.

And strangely, it seems slower on A100...

I merge your code into a local branch here cause the pattern is very interesting and that because now thanks to your work, we know that local static array is not using register but global memory and here replaced by faster scratch memory. The code can switch on the two implementations (with and wo scratch memory), by a TRUST_USE_SCRATCH_MEMORY environment variable to test.

@pledac

pledac commented Mar 28, 2025

Copy link
Copy Markdown
Member

I add Adrien and Rémi to discuss about the benefice/complexity ratio introduced by using scratch memory. To give an idea 30% speedup is the probable gain by using the good layout on this kernel. What bothers me, for example, is the size of the warps set here to 32. Does this value GPU specific, is it the same on AMD, and what if in 10 years with future GPU cards ? Kokkos provide portability of performance, and in my poor understanding, developer should not care about this value.

According to Hari tests, scratch memory through hierarchical memory is not interesting on other kernels (like diffusion one).

@pledac pledac self-assigned this Mar 28, 2025
@pledac

pledac commented Mar 29, 2025

Copy link
Copy Markdown
Member

On MI250X AMD, the slowdown with scratch memory is between 7% and 25% (warp size 64?).

Remove optimization of scratch memory size for order 3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants