support `csrc/flash_attn_with_bias_and_mask/src/fmha/smem_tile.h` cuda132 build by gouzil · Pull Request #153 · PaddlePaddle/flash-attention

gouzil · 2026-05-25T03:09:32Z

feat

csrc/flash_attn_with_bias_and_mask/src/fmha/smem_tile.h 支持 cuda 13.2 编译，当前 FA3适配 cuda 13.2 还比较麻烦, 暂时先在 paddle 主仓库跳过编译。

完整改动可以查看 (仅限能编译)：#141
paddle 适配 cuda 13.2 pr: PaddlePaddle/Paddle#78720

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates SM90 pipeline barrier initialization to correctly handle non-warpgroup-multiple consumer counts, and refactors FMHA shared-memory tile fragment loads/stores for readability and reuse.

Changes:

Round up params.num_consumers to whole warp-groups when computing mbarrier arrival counts (flashmask v2 + flash-attn v3).
Refactor Smem_tile_transpose fragment store/load into helper methods and apply minor formatting cleanups.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
csrc/flashmask_v2/sm90_pipeline_no_cluster.hpp	Adjusts consumer arrival count computation to use ceil(#consumers / warpgroup_threads).
csrc/flash_attn_with_bias_and_mask/src/fmha/smem_tile.h	Extracts fragment load/store helpers; restructures offset computations and removes dead comments.
csrc/flash_attn_v3/sm90_pipeline_no_cluster.hpp	Same barrier consumer arrival count rounding fix as flashmask v2.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        uint32_t offset = smem_ + base_offset;
+        const uint32_t reg0 = frag.reg(0);
+        const uint32_t reg1 = frag.reg(1);
+        const uint32_t reg2 = frag.reg(2);
+        const uint32_t reg3 = frag.reg(3);
+        fmha::sts(offset + 0 * BYTES_PER_ROW, reg0);
+        fmha::sts(offset + 8 * BYTES_PER_ROW, reg2);
+        offset ^= 4 * BYTES_PER_STS;
+        fmha::sts(offset + 0 * BYTES_PER_ROW, reg1);
+        fmha::sts(offset + 8 * BYTES_PER_ROW, reg3);


 // PipelineTmaAsync before v3.6.0 where only 1 out of 128 threads signals the barrier.
 //
-// Assumption: params.num_consumers % NumThreadsPerWarpGroup == 0
+// Count consumers in whole warpgroups. A single consumer warp still needs one


 // PipelineTmaAsync before v3.6.0 where only 1 out of 128 threads signals the barrier.
 //
-// Assumption: params.num_consumers % NumThreadsPerWarpGroup == 0
+// Count consumers in whole warpgroups. A single consumer warp still needs one


umiswing · 2026-05-26T11:33:32Z

+        const uint32_t reg3 = frag.reg(3);
+        fmha::sts(offset + 0 * BYTES_PER_ROW, reg0);
+        fmha::sts(offset + 8 * BYTES_PER_ROW, reg2);
+        offset ^= 4 * BYTES_PER_STS;


这个offset的变换是不是和修改前不等价

…the code

umiswing · 2026-05-26T15:17:56Z

LGTM

[test] support cuda132 build

e0e68b2

Copilot AI review requested due to automatic review settings May 25, 2026 03:09

Copilot AI reviewed May 25, 2026

View reviewed changes

rollback

bda9b37

gouzil changed the title ~~[WIP][test] support cuda132 build~~ support csrc/flash_attn_with_bias_and_mask/src/fmha/smem_tile.h cuda132 build May 25, 2026

gouzil mentioned this pull request May 25, 2026

【Hackathon 10th Spring No.51】Environment Adaptation support Paddle on CUDA 13.2 PaddlePaddle/Paddle#78720

Open

umiswing suggested changes May 26, 2026

View reviewed changes

fix store_fragment The incorrect calculation of the memory offset in …

5518810

…the code

GuoxiaWang approved these changes May 26, 2026

View reviewed changes

GuoxiaWang merged commit 1f3e4bb into PaddlePaddle:main May 26, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support `csrc/flash_attn_with_bias_and_mask/src/fmha/smem_tile.h` cuda132 build#153

support `csrc/flash_attn_with_bias_and_mask/src/fmha/smem_tile.h` cuda132 build#153
GuoxiaWang merged 3 commits into
PaddlePaddle:mainfrom
gouzil:test/flash_support_cuda132

gouzil commented May 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

umiswing May 26, 2026

Uh oh!

gouzil May 26, 2026

Uh oh!

umiswing commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

gouzil commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

umiswing May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gouzil May 26, 2026

Choose a reason for hiding this comment

Uh oh!

umiswing commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gouzil commented May 25, 2026 •

edited

Loading