【Hackathon 10th Spring No.51】Environment Adaptation support Paddle on CUDA 13.2 by gouzil · Pull Request #78720 · PaddlePaddle/Paddle

gouzil · 2026-04-19T16:54:43Z

PR Category

Environment Adaptation

PR Types

Improvements

Description

该 PR 用于推进 Paddle 对 CUDA 13.2 的适配，当前以 Draft / WIP 形式提交，便于继续补齐编译与验证。

本次分支主要包含以下改动：

补充 CUDA 13.x 的 GPU arch 配置，并在 cuda.cmake 中接入 13.x 工具链分支
~~2. 调整 warpctc 和 warprnnt 透传的 NVCC flags，过滤 CUDA 13.x 不再支持的旧 gencode 目标~~
在 CUDA 13.x 默认构建中暂时关闭当前不稳定的 flash-attn 和部分 FP8 kernel 路径，规避已知第三方构建和编译器问题
调整 kernel 注册与 range/arange 显式实例化写法，绕过 nvcc 13.x 的内部编译错误
在 tools/dockerfile/manylinux/common/install_cuda.sh 中新增 13.2 安装分支
新增 CUDA 13.2 对应的 cuDNN、NCCL、TensorRT、cuSPARSELt 安装逻辑
在 tools/dockerfile/manylinux/Dockerfile 中新增 cuda13.2 stage，调用 install_cuda.sh 13.2
这个 pr 同时也支持了 x86 和 arm 共用同一个 Dockerfile

待完成

flash-attn 编译修复合入 ~~Fix CUDA 13.2 flash attention build compatibility flash-attention#141~~ support csrc/flash_attn_with_bias_and_mask/src/fmha/smem_tile.h cuda132 build flash-attention#153

是否引起精度变化

否

paddle-bot · 2026-04-19T16:54:48Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR advances PaddlePaddle’s CUDA 13.2 environment/toolchain adaptation by updating CUDA arch configuration, adjusting third-party NVCC flag handling for removed GPU targets, and applying workarounds for CUDA 13.x compiler issues (e.g., kernel registration / explicit instantiation).

Changes:

Add CUDA 13.x GPU arch lists and route CUDA 13.x toolchain behavior in cmake/cuda.cmake.
Adjust build logic to avoid CUDA 13.x build breakages by disabling unstable components/paths (flash-attn, some FP8 kernels) and updating kernel template instantiation patterns.
Add a CUDA 13.2 manylinux build Dockerfile and refine third-party (warpctc/warprnnt) NVCC flags to drop unsupported gencode targets.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tools/dockerfile/manylinux/Dockerfile-132	Adds a CUDA 13.2 manylinux build image with Python 3.12, toolchain utilities, and NCCL packages.
paddle/phi/kernels/gpu/range_kernel.cu	Replaces `decltype`-based explicit instantiation with standard explicit instantiation declarations for CUDA 13.x compatibility.
paddle/phi/kernels/gpu/arange_kernel.cu	Same explicit-instantiation adjustment as `range_kernel.cu` for CUDA 13.x.
paddle/phi/kernels/CMakeLists.txt	Skips selected FP8 CUDA kernel sources when building with CUDA ≥ 13.0 to avoid NVCC internal errors.
paddle/phi/core/kernel_registry.h	Introduces a CUDA 13.x-specific workaround for template instantiation during kernel registration to avoid NVCC `cudafe++` crashes.
cmake/third_party.cmake	Disables flash-attn for CUDA 13.x default builds due to toolchain instability.
cmake/external/warprnnt.cmake	Filters unsupported legacy `-gencode` targets from NVCC flags for CUDA ≥ 13.0 when building warprnnt.
cmake/external/warpctc.cmake	Filters unsupported legacy `-gencode` targets from NVCC flags for CUDA ≥ 13.0 when building warpctc.
cmake/cuda.cmake	Adds CUDA 13 arch presets and selects CUDA 13-specific arch lists for NVCC flag generation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-19T17:02:59Z

+ARG BASE_TARGET=cuda${CUDA_VERSION}
+
+FROM nvcr.io/nvidia/cuda:13.2.0-cudnn-devel-ubuntu24.04 as base


CUDA_VERSION / BASE_TARGET build args are defined but the base image tag is hard-coded (FROM nvcr.io/nvidia/cuda:13.2.0-...). This makes overrides like --build-arg CUDA_VERSION=... ineffective and leaves BASE_TARGET unused. Consider either wiring the args into FROM (e.g., via FROM ...:${CUDA_VERSION}...) or removing the unused args to avoid confusion in the build pipeline.

Suggested change

ARG BASE_TARGET=cuda${CUDA_VERSION}

FROM nvcr.io/nvidia/cuda:13.2.0-cudnn-devel-ubuntu24.04 as base

FROM nvcr.io/nvidia/cuda:${CUDA_VERSION}.0-cudnn-devel-ubuntu24.04 as base

Copilot · 2026-04-19T17:03:00Z

+  if(${CMAKE_CUDA_COMPILER_VERSION} GREATER_EQUAL 13.0)
+    message(
+      STATUS
+        "flash-attn is disabled for default CUDA 13.x builds because the bundled third_party/flashattn source build is not yet stable with this toolchain."
+    )


For CUDA >= 13.0 the code prints that flash-attn is disabled, but it doesn't explicitly force WITH_FLASHATTN / WITH_FLASHATTN_V3 to OFF. If a user (or cached CMake value) sets WITH_FLASHATTN=ON, later if(WITH_FLASHATTN) blocks (e.g., in paddle/phi/CMakeLists.txt) can try to add dependencies on a flashattn target that was never created in this branch, causing configuration/build failures. Consider explicitly setting WITH_FLASHATTN and WITH_FLASHATTN_V3 to OFF in this branch.

gouzil · 2026-04-20T06:25:37Z

-template decltype(RangeNullaryKernel<int64_t, GPUContext>) RangeNullaryKernel;
-template decltype(RangeNullaryKernel<int, GPUContext>) RangeNullaryKernel;
+template void RangeNullaryKernel<int64_t, GPUContext>(const GPUContext&,
+                                                      const int64_t,
+                                                      const int64_t,
+                                                      const int64_t,
+                                                      DenseTensor*);
+template void RangeNullaryKernel<int, GPUContext>(
+    const GPUContext&, const int, const int, const int, DenseTensor*);


最小复现案例，同样的 case 1，在 cuda 13.0 可以，在 cuda 13.2 会报错 internal error: assertion failed at: "types.h", line 413 in rout_type_supp

template <typename Function> struct KernelArgsParseFunctor; template <typename Return, typename... Args> struct KernelArgsParseFunctor<Return (*)(Args...)> { static void Parse(int, int) {} }; template <typename Function, Function function> struct KernelImpl { static void Compute() {} static void VariadicCompute() {} }; struct KernelRegistrar { KernelRegistrar(const char*, void (*)(int, int), void (*)(int, int), void (*)(), void*) {} }; template <typename T, typename Context> void ShortKernel(Context, int) {} #if CASE == 1 template decltype(ShortKernel<float, int>) ShortKernel<float, int>; #elif CASE == 2 using FunctionType = decltype(ShortKernel<float, int>); FunctionType* function_ptr = &ShortKernel<float, int>; #elif CASE == 3 using FunctionPtrType = decltype(&ShortKernel<float, int>); static auto* compute_ptr = &KernelImpl<FunctionPtrType, &ShortKernel<float, int>>::Compute; static void* variadic_ptr = reinterpret_cast<void*>( &KernelImpl<FunctionPtrType, &ShortKernel<float, int>>::VariadicCompute); #elif CASE == 4 template void ShortKernel<float, int>(int, int); #elif CASE == 5 static void ProbeArgsDef(int, int) {} using RegisterFunction = decltype(&ShortKernel<float, int>); static const KernelRegistrar probe_registrar( "probe", &KernelArgsParseFunctor<RegisterFunction>::Parse, &ProbeArgsDef, &KernelImpl<RegisterFunction, &ShortKernel<float, int>>::Compute, reinterpret_cast<void*>( &KernelImpl<RegisterFunction, &ShortKernel<float, int>>:: VariadicCompute)); #else int unused = 0; #endif

…cu132

gouzil · 2026-04-22T12:04:55Z

编译了个测试版本，cmake 命令如下 (记得修改一下后缀):

cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_GPU=ON \
    -DWITH_SHARED_PHI=ON -DWITH_TENSORRT=ON -DWITH_OPENVINO=OFF -DWITH_ROCM=OFF -DWITH_CINN=ON \
    -DWITH_DISTRIBUTE=ON -DWITH_MKL=ON -DWITH_AVX=ON -DCUDA_ARCH_NAME=Manual -DNEW_RELEASE_PYPI=OFF -DNEW_RELEASE_ALL=OFF \
    -DNEW_RELEASE_JIT=OFF -DWITH_PYTHON=ON -DCUDNN_ROOT=/usr/ -DWITH_TESTING=OFF -DWITH_COVERAGE=OFF -DWITH_INCREMENTAL_COVERAGE=OFF \
    -DCMAKE_MODULE_PATH=/opt/rocm/hip/cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DWITH_INFERENCE_API_TEST=OFF \
    -DINFERENCE_DEMO_INSTALL_DIR=/root/.cache/inference_demo -DPY_VERSION=3.12 -DCMAKE_INSTALL_PREFIX=/paddle/build \
    -DWITH_PSLIB= -DWITH_GLOO=ON -DWITH_XPU=OFF -DWITH_IPU=OFF -DXPU_SDK_ROOT= -DWITH_XPU_BKCL=OFF -DWITH_XPU_XHPC=OFF -DWITH_XPU_XFT=OFF \
    -DWITH_XPU_XRE5=OFF -DWITH_XPU_FFT=OFF -DWITH_ARM=OFF -DWITH_STRIP=ON -DON_INFER=ON -DCUDA_ARCH_BIN="75 80 86 90 100 103 120" -DWITH_RECORD_BUILDTIME=OFF \
    -DWITH_UNITY_BUILD=OFF -DWITH_ONNXRUNTIME=OFF -DWITH_CUDNN_FRONTEND=OFF -DWITH_CPP_TEST=OFF -DWITH_FA_BUILD_WITH_CACHE=OFF

https://github.com/gouzil/Paddle/releases/download/v3.5-cu132/paddlepaddle_gpu-3.5.0.dev20260421-cp312-cp312-linux_x86_64.whl.1

gouzil · 2026-04-23T02:53:24Z

+ENV WITH_GPU=${WITH_GPU:-ON}
+ENV WITH_AVX=${WITH_AVX:-ON}
+ENV DEBIAN_FRONTEND=noninteractive
+ENV LD_LIBRARY_PATH=/usr/local/cuda-13.2/compat:/usr/local/cuda-13.2/targets/x86_64-linux/lib:/usr/local/cuda-13.2/lib64:$LD_LIBRARY_PATH


这里直接写死的原因是基础镜像已经有了 CUDA_VERSION 并且是 3 位版本号，而动态库只会有 2 位版本号。并且我们其实是对不同 cuda 版本单独写的 Dockerfile 所以能这么干

其他 Dockerfile 应该也有同样的问题才对

相关链接：https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/13.2.1/ubuntu2404/base/Dockerfile?ref_type=heads#L29

SigureMo · 2026-04-23T03:41:45Z

+ENV WITH_GPU=${WITH_GPU:-ON}
+ENV WITH_AVX=${WITH_AVX:-ON}
+ENV DEBIAN_FRONTEND=noninteractive
+ENV LD_LIBRARY_PATH=/usr/local/cuda-13.2/compat:/usr/local/cuda-13.2/targets/x86_64-linux/lib:/usr/local/cuda-13.2/lib64:$LD_LIBRARY_PATH


这个 dockerfile 可以同时考虑 x86 和 arm 么？避免 arm 还要单独维护一份

gouzil · 2026-04-24T14:35:56Z

有概率第一次 cmake 会出现下面的报错

CMake Error at cmake/cinn/core.cmake:27 (add_library):
  Cannot find source file:

    /paddle/build/paddle/cinn/hlir/dialect/operator/ir/cinn_op.cc

  Tried extensions .c .C .c++ .cc .cpp .cxx .cu .mpp .m .M .mm .ixx .cppm
  .ccm .cxxm .c++m .h .hh .h++ .hm .hpp .hxx .in .txx .f .F .for .f77 .f90
  .f95 .f03 .hip .ispc
Call Stack (most recent call first):
  paddle/cinn/hlir/dialect/operator/ir/CMakeLists.txt:61 (cinn_cc_library)


CMake Error at cmake/cinn/core.cmake:27 (add_library):
  No SOURCES given to target: cinn_op_dialect
Call Stack (most recent call first):
  paddle/cinn/hlir/dialect/operator/ir/CMakeLists.txt:61 (cinn_cc_library)


CMake Generate step failed.  Build files cannot be regenerated correctly.

…e CUDA 13.x compatibility issue of flash-attn

…cu132

SigureMo

Review from @codex (GPT-5.5 xhigh).

整体方向 OK，但我会先卡一下 CUDA 13.2 arch 覆盖问题：这个 PR 的目标是 CUDA 13.2 适配，当前默认发布 arch 集合还没有覆盖 nvcc 13.2 已支持的部分 Blackwell targets。

Blocking:

cmake/cuda.cmake 里的 CUDA 13 默认 arch 集合只有 75 80 86 90 100，CUDA_ARCH_NAME=Blackwell 也仍只映射到 100。CUDA 13.2 nvcc 官方支持列表已经包含 sm_103/sm_110/sm_120/sm_121 等 targets；这样 CUDA_ARCH_NAME=All 的 CUDA 13.2 wheel 默认不会带这些 targets。建议补齐 release arch 策略，或至少明确用 PTX/JIT 覆盖。参考：https://docs.nvidia.com/cuda/archive/13.2.0/pdf/CUDA_Compiler_Driver_NVCC.pdf

Non-blocking but should sync:
2. PR 描述还写着 Draft / WIP，并说 CUDA 13.x 默认构建暂时关闭 flash-attn；但 PR 现在不是 draft，当前 cmake/third_party.cmake 对 CUDA >= 13.0 仍会在 arch >= 80 时 include external/flashattn 并打开 WITH_FLASHATTN。建议同步 PR body，避免 reviewer 误判。
3. Dockerfile-132 里 CUDA_VERSION / BASE_TARGET 仍是摆设：FROM 和 LD_LIBRARY_PATH 都硬编码 13.2。如果这是单版本 Dockerfile，可以删掉 arg；如果希望可复用，就要真正接入这些 args。

SigureMo · 2026-05-14T06:43:52Z

  set(paddle_known_gpu_archs10 "50 52 60 61 70 75")
  set(paddle_known_gpu_archs11 "50 60 61 70 75 80")
  set(paddle_known_gpu_archs12 "50 60 61 70 75 80 90 100")
+  set(paddle_known_gpu_archs13 "75 80 86 90 100")


Review from @codex (GPT-5.5 xhigh).

这里的 CUDA 13 默认 arch 集合只有 75 80 86 90 100，后面的 CUDA_ARCH_NAME=Blackwell 也仍只映射到 100。CUDA 13.2 nvcc 官方支持列表已经包含 sm_103/sm_110/sm_120/sm_121 等 targets；如果 CUDA 13.2 release wheel 仍用 CUDA_ARCH_NAME=All，这些 targets 默认不会被编进 wheel。建议补齐 CUDA 13 的 release arch 策略，或明确用 PTX/JIT 覆盖这些新架构。

这个 @risemeup1 @swgu98 确认下，感觉至少 103 得加一下

优先级：P0

Follow-up：当前 head b562e19d8d4432387a58fa8fa901debfb6fe6d5c 是 empty commit，CUDA 13 arch 配置没有变化。cmake/cuda.cmake:22 仍只有 75 80 86 90 100，cmake/cuda.cmake:195 的 CUDA_ARCH_NAME=Blackwell 仍只映射到 100，因此 CUDA 13.2 release/Blackwell 默认包仍不会覆盖前面指出的新增 Blackwell targets。请补齐 CUDA 13.2 release arch 策略，或明确加入 PTX/JIT 覆盖策略。

SigureMo · 2026-05-14T06:43:52Z

+ARG CUDA_VERSION=13.2
+ARG BASE_TARGET=cuda${CUDA_VERSION}
+
+FROM nvcr.io/nvidia/cuda:13.2.0-cudnn-devel-ubuntu24.04 as base


Review from @codex (GPT-5.5 xhigh).

CUDA_VERSION / BASE_TARGET 在上面定义了，但这里 FROM 仍硬编码 13.2.0，下面 LD_LIBRARY_PATH 也硬编码 13.2。如果这个文件就是单版本 Dockerfile，建议删掉这些 unused args；如果希望保留可配置语义，就要把它们真正接进 FROM 和路径里。

…cu132

… for checking the CUDA version

risemeup1111

Summary: thanks for continuing the CUDA 13.2 adaptation. I still see two blockers before this can land: the release arch defaults do not cover CUDA 13.2's supported Blackwell targets, and the flash-attn submodule now depends on a personal fork/unmerged commit.

Blocking findings:

cmake/cuda.cmake still defines the CUDA 13 release arch sets as only 75 80 86 90 100 (cmake/cuda.cmake:22, also the other CUDA 13 sets at lines 14, 30, 38, and 44), and CUDA_ARCH_NAME=Blackwell still maps only to 100 (cmake/cuda.cmake:194). CUDA 13.2 nvcc documents additional Blackwell real targets including sm_103, sm_110, sm_120, and sm_121 (https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html). With the current defaults, CUDA_ARCH_NAME=All / release builds will not emit cubins for these CUDA 13.2 targets, so e.g. SM120/SM121 devices are not actually covered by the CUDA 13.2 wheel strategy. Please either add the intended CUDA 13.2 target set or explicitly add a PTX/JIT strategy that covers them.
.gitmodules changes third_party/flashattn from the PaddlePaddle upstream to https://github.com/gouzil/flash-attention.git (.gitmodules:73-75), and the submodule is advanced to bda9b377..., which is the head of the still-open PaddlePaddle/flash-attention#153. This makes Paddle's build depend on a contributor fork rather than a reviewed PaddlePaddle-owned dependency. Please land the flash-attention change in the PaddlePaddle repo first, then point this submodule back at https://github.com/PaddlePaddle/flash-attention.git and update to that accepted commit.

…cu132

risemeup1111

由 Nyanpasu 驱动，AI 生成的代码审查建议，请维护者结合上下文仔细甄别。

Follow-up 结论：当前 head 06697b25724422296ac0a8b7e4ba98e3776937ac 的 checks 已通过，本轮新增提交主要调整了 Dockerfile-132 的依赖和 LD_LIBRARY_PATH，我没有发现新的独立 blocker。但上轮两个 blocker 仍未解决，因此本轮仍是 REQUEST_CHANGES。为避免重复已有 active inline thread，本轮没有在相同行新增 inline comment。

上轮问题状态：

CUDA 13.2 arch 覆盖：未解决。cmake/cuda.cmake:22 仍是 75 80 86 90 100，cmake/cuda.cmake:195 的 Blackwell 仍只映射到 100。
flash-attn 子模块来源：未解决。.gitmodules:75 仍指向 https://github.com/gouzil/flash-attention.git，third_party/flashattn:1 仍是 bda9b377...；关联的 PaddlePaddle/flash-attention#153 当前仍为 OPEN。

优先级：P0 cmake/cuda.cmake:22 和 cmake/cuda.cmake:195 仍未覆盖 CUDA 13.2 已支持的更多 Blackwell targets，release CUDA_ARCH_NAME=All / CUDA_ARCH_NAME=Blackwell 包仍不会默认产出这些目标的 cubin。请补齐 CUDA 13.2 release arch 策略，或明确加入 PTX/JIT 覆盖策略。未新增 inline：同位置已有 active thread PRRT_kwDOA-qtos6B-4qW。

优先级：P0 .gitmodules:75 和 third_party/flashattn:1 仍让 Paddle 主仓依赖个人 fork 上的未合入 flash-attn 提交。请先将 flash-attention 变更合入 PaddlePaddle/flash-attention，再把子模块 URL 改回 https://github.com/PaddlePaddle/flash-attention.git 并更新到已合入 commit。未新增 inline：同位置已有 active thread PRRT_kwDOA-qtos6EvdxN。

risemeup1111

由 Nyanpasu 驱动，AI 生成的代码审查建议，请维护者结合上下文仔细甄别。

结论：当前 head b562e19d8d4432387a58fa8fa901debfb6fe6d5c 相比上次 review 的 06697b25724422296ac0a8b7e4ba98e3776937ac 是 empty commit，代码没有变化。上轮两个 P0 blocker 仍未解决，且当前 Check approval 仍失败，因此继续 REQUEST_CHANGES。

状态索引：

PRRT_kwDOA-qtos6B-4qW / cmake/cuda.cmake:22, cmake/cuda.cmake:195 — 未解决：已在线程回复 PRRC_kwDOA-qtos7E9hfM 补充状态；请补齐 CUDA 13.2 release/Blackwell arch 策略或 PTX/JIT 覆盖。
PRRT_kwDOA-qtos6EvdxN / .gitmodules:75, third_party/flashattn:1 — 未解决：已在线程回复 PRRC_kwDOA-qtos7E9hfs 补充状态；请先合入 PaddlePaddle/flash-attention 官方提交，再将 submodule 指回官方仓库和已合入 commit。
CI — 未完成：Check approval 当前失败，多个检查仍 queued/in_progress；需要在阻塞项修复后重新观察完整 CI。

risemeup1111

由 Nyanpasu 驱动，AI 生成的代码审查建议，请维护者结合上下文仔细甄别。

结论：本轮复查 head aa3920f0c43660700e1ba68e63a4ddb38fbc908b。上轮 flash-attn 子模块来源问题已修复；CUDA 13.2 arch 覆盖问题仍未解决，已在当前 diff 行新增 1 条 P0 inline review comment，详细问题请看 inline comment。当前决定：REQUEST_CHANGES。

本轮 inline comments：1 条（P0）。CI 当前仍未全绿：Check approval 和 Coverage build 失败，部分检查仍在运行。

risemeup1111 · 2026-05-26T16:01:33Z

  set(paddle_known_gpu_archs10 "50 52 60 61 70 75")
  set(paddle_known_gpu_archs11 "50 60 61 70 75 80")
  set(paddle_known_gpu_archs12 "50 60 61 70 75 80 90 100")
+  set(paddle_known_gpu_archs13 "75 80 86 90 100")


优先级：P0

当前 head 仍未补齐 CUDA 13.2 的 release/Blackwell arch 策略：这里的 CUDA 13 release arch 仍只有 75 80 86 90 100，并且同文件 CUDA_ARCH_NAME=Blackwell 仍只映射到 100。CUDA 13.2 nvcc 已支持更多 Blackwell targets（如 sm_103/sm_110/sm_120/sm_121），因此 CUDA_ARCH_NAME=All 或 Blackwell release 包仍不会默认产出这些目标的 cubin。请补齐 CUDA 13.2 的 arch 列表，或明确加入可接受的 PTX/JIT 覆盖策略。

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-27 10:00:00

📋 Review 摘要

PR 概述：为 Paddle 添加 CUDA 13.2 完整环境适配，包含 cmake GPU arch 配置、内核注册宏兼容性修复、Dockerfile 和安装脚本新增 CUDA 13.2 分支、Python 打包依赖更新。
变更范围：cmake/、paddle/phi/core/kernel_registry.h、paddle/phi/kernels/gpu/、setup.py、python/setup.py.in、tools/dockerfile/
影响面 Tag：[Environment Adaptation] [Operator Mechanism]

问题

级别	文件	概述
🔴 Bug	`tools/dockerfile/manylinux/common/install_cuda.sh:70`	`install_cusparselt_090_cuda13` 硬编码 x86_64 路径，aarch64 构建失败
🔴 Bug	`tools/dockerfile/manylinux/common/install_cuda.sh:111`	`install_nccl_2297_cuda132` 使用 `yum`，但 Dockerfile-132 基于 Ubuntu，`yum` 不存在
🔴 Bug	`setup.py:1398`	`cuda_major_version` 始终为 `None`，CUDA 13.2 TensorRT 适配逻辑永远不执行
🟡 建议	`setup.py:1406`	非 CUDA 13.2 路径的 TensorRT 列表缺少 `tensorrt==10.3.0`（setup.py.in 已有）

📝 PR 规范检查

PR 描述四个必填 section（PR Category、PR Types、Description、是否引起精度变化）均已填写，内容具体，规范合规。标题未使用 [Tag] 格式，但包含语义信息，属可接受范围。

总体评价

该 PR 对 CUDA 13.2 的适配整体思路正确，cmake arch 配置、内核宏兼容修复、FlashAttn 版本门控等改动合理。但存在 3 个 🔴 P0 问题：aarch64 cuSPARSELt 下载缺失平台分支、NCCL 安装脚本包管理器不匹配、cuda_major_version 未赋值导致 CUDA 13.2 TensorRT 依赖逻辑空转。建议修复后合入。

PaddlePaddle-bot · 2026-05-26T16:15:24Z

+    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
+    mkdir tmp_cusparselt && pushd tmp_cusparselt
+    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.9.0.3_cuda13-archive.tar.xz
+    tar xf libcusparse_lt-linux-x86_64-0.9.0.3_cuda13-archive.tar.xz


🔴 Bug install_cusparselt_090_cuda13 函数硬编码了 linux-x86_64 路径，aarch64 架构下会下载失败。

当前代码：

wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.9.0.3_cuda13-archive.tar.xz

PR 描述提到 Dockerfile-132 同时支持 x86 和 arm，但此函数没有平台分支逻辑。建议修复方式：

ARCH=$(uname -m) if [ "$ARCH" = "aarch64" ]; then CUSPARSELT_ARCH="linux-aarch64" CUSPARSELT_PKG="libcusparse_lt-linux-aarch64-0.9.0.3_cuda13-archive" else CUSPARSELT_ARCH="linux-x86_64" CUSPARSELT_PKG="libcusparse_lt-linux-x86_64-0.9.0.3_cuda13-archive" fi wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/${CUSPARSELT_ARCH}/${CUSPARSELT_PKG}.tar.xz

PaddlePaddle-bot · 2026-05-26T16:15:24Z


+function install_nccl_2297_cuda132 {
+    yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
+    yum install -y \


🔴 Bug install_nccl_2297_cuda132 使用了 yum-config-manager 和 yum install，但该函数在 install_132 中被调用，而 Dockerfile-132 基于 Ubuntu 24.04（APT 系）。yum 命令在 Ubuntu 镜像中不存在，会导致构建失败。

Dockerfile-132 已在 apt-get install 阶段安装了 libnccl2=2.29.7-1+cuda13.2，建议删除 install_132 中对 install_nccl_2297_cuda132 的调用，或将 install_nccl_2297_cuda132 改为 APT 实现：

function install_nccl_2297_cuda132 { apt-get update apt-get install -y libnccl2=2.29.7-1+cuda13.2 libnccl-dev=2.29.7-1+cuda13.2 }

PaddlePaddle-bot · 2026-05-26T16:15:24Z

+        )
+        if platform.system() == 'Linux' and cuda_major_version == '13.2':
+            if not version_str and platform.machine() == 'aarch64':
+                return paddle_cuda_requires, ["tensorrt-cu13==10.16.1.11"]


🔴 Bug cuda_major_version 变量始终为 None（声明后从未赋值），导致 if platform.system() == 'Linux' and cuda_major_version == '13.2': 条件永远不成立，CUDA 13.2 的 TensorRT tensorrt-cu13==10.16.1.11 适配逻辑永远不会被执行。

需要在 PADDLE_CUDA_INSTALL_REQUIREMENTS 填充后，根据实际 CUDA 版本给 cuda_major_version 赋值，例如：

for cuda_ver in PADDLE_CUDA_INSTALL_REQUIREMENTS: # 通过环境变量或 nvcc 版本确定当前 CUDA 版本 pass # 或从构建环境变量读取： cuda_major_version = env_dict.get("CUDA_MAJOR_VERSION") # 如 '13.2'

同样的问题也存在于 python/setup.py.in 中。

PaddlePaddle-bot · 2026-05-26T16:15:24Z

+            platform.system() == 'Windows'
+            and version_default is not None
+            and version_default >= 10
        ):


🟡 建议 非 CUDA 13.2 路径的 PADDLE_TENSORRT_INSTALL_REQUIREMENTS 列表中缺少 tensorrt==10.3.0。

对比 python/setup.py.in 中已包含该版本（见 setup.py.in 第 886-892 行），但 setup.py 的对应位置（新增的 elif 分支里）只有：

"tensorrt==8.5.3.1", "tensorrt==8.6.0", "tensorrt==8.6.1.post1", # 缺少 "tensorrt==10.3.0"

建议补充 "tensorrt==10.3.0" 以与 setup.py.in 保持一致。

SigureMo · 2026-05-26T16:57:49Z

@ShigureNyako 把这个 PR cherry-pick 到 release/3.4 吧，emmm，可能有点麻烦，你自己 squash 成一个 commit 后 cherry-pick 吧

ShigureNyako · 2026-05-26T17:15:32Z

已按要求把当前 head aa3920f0c43660700e1ba68e63a4ddb38fbc908b 的改动 squash 成一个 backport commit，并开到 release/3.4：

Backport PR: [release/3.4][CUDA] Support Paddle on CUDA 13.2 #79154
Branch: ShigureNyako:cherry-pick/78720/release-3.4
Commit: 363ad9abe2d0aac09f73132862990171032594f1

说明：release/3.4 上补丁干净应用，无文本冲突；本地已跑 git diff HEAD --check、bash -n tools/dockerfile/manylinux/common/install_cuda.sh、python3 -m py_compile setup.py 和 CUDA 13.2 相关轻量一致性检查。完整 CUDA/Paddle build 本机环境不可行；Docker BuildKit --check 因本机 Docker daemon 未运行无法执行。

Copilot AI review requested due to automatic review settings April 19, 2026 16:54

gouzil requested a review from zhangbo9674 as a code owner April 19, 2026 16:54

Copilot started reviewing on behalf of gouzil April 19, 2026 16:55 View session

gouzil removed the request for review from zhangbo9674 April 19, 2026 16:55

gouzil and others added 4 commits April 20, 2026 01:02

feat: Add CUDA 13.2 Dockerfile

8f87fdb

Update tools/dockerfile/manylinux/Dockerfile-132

40c7e79

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

add nccl

234835a

[CUDA] adapt Paddle build and kernels for CUDA 13.2

4c1032b

gouzil force-pushed the cuda/support_cu132 branch from 1660260 to 4c1032b Compare April 19, 2026 17:02

Copilot AI reviewed Apr 19, 2026

View reviewed changes

paddle-bot Bot added the contributor External developers label Apr 19, 2026

rollback warpctc and warpnnt

f57a196

gouzil commented Apr 20, 2026

View reviewed changes

gouzil added 4 commits April 20, 2026 17:55

Merge branch 'develop' of github.com:gouzil/Paddle into cuda/support_…

d430605

…cu132

clean code

d70ff89

setup.py add cuda 13.2

dcda460

dockerfile add global.break-system-packages

225a3ee

update LD_LIBRARY_PATH

a308b5d

gouzil commented Apr 23, 2026

View reviewed changes

gouzil mentioned this pull request Apr 23, 2026

feat: Add CUDA 13.2 Dockerfile #78673

Closed

gouzil requested review from SigureMo, risemeup1 and swgu98 April 23, 2026 02:57

SigureMo reviewed Apr 23, 2026

View reviewed changes

gouzil added 3 commits April 27, 2026 08:57

Update the Dockerfile to support multi-architecture builds and fix th…

b2b5a8f

…e CUDA 13.x compatibility issue of flash-attn

Merge branch 'develop' of github.com:gouzil/Paddle into cuda/support_…

0678a56

…cu132

Add CUDA 13.2 manylinux support

5232181

SigureMo requested changes May 14, 2026

View reviewed changes

gouzil added 2 commits May 14, 2026 21:01

Merge branch 'develop' of github.com:gouzil/Paddle into cuda/support_…

a823383

…cu132

Update the URL of the flash-attention sub-module and adjust the logic…

c729efa

… for checking the CUDA version

gouzil changed the title ~~[CUDA13.2] Environment Adaptation support Paddle on CUDA 13.2~~ 【Hackathon 10th Spring No.51】[CUDA13.2] Environment Adaptation support Paddle on CUDA 13.2 May 19, 2026

gouzil changed the title ~~【Hackathon 10th Spring No.51】[CUDA13.2] Environment Adaptation support Paddle on CUDA 13.2~~ 【Hackathon 10th Spring No.51】Environment Adaptation support Paddle on CUDA 13.2 May 19, 2026

luotao1 mentioned this pull request May 20, 2026

【Hackathon 10th】开源贡献个人挑战赛 · 春节特别季 #77429

Open

gouzil mentioned this pull request May 25, 2026

support csrc/flash_attn_with_bias_and_mask/src/fmha/smem_tile.h cuda132 build PaddlePaddle/flash-attention#153

Merged

gouzil added 4 commits May 26, 2026 00:55

[CUDA] Restrict CUDA compiler version for flash attention support

230dcc6

update Dockerfile-132 add Python 3.12.13

40ecfce

fix

607d54d

fix py version

14e8ecc

risemeup1111 suggested changes May 26, 2026

View reviewed changes

update Dockerfile-132

ab055a2

This comment was marked as outdated.

Sign in to view

Merge branch 'develop' of github.com:gouzil/Paddle into cuda/support_…

06697b2

…cu132

risemeup1111 suggested changes May 26, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

empty commit

b562e19

risemeup1111 suggested changes May 26, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

update flash-attention to org, Dockerfile-132 add gdrcopy

aa3920f

risemeup1111 suggested changes May 26, 2026

View reviewed changes

PaddlePaddle-bot suggested changes May 26, 2026

View reviewed changes

ShigureNyako mentioned this pull request May 26, 2026

[release/3.4][CUDA] Support Paddle on CUDA 13.2 #79154

Open

		ARG BASE_TARGET=cuda${CUDA_VERSION}

		FROM nvcr.io/nvidia/cuda:13.2.0-cudnn-devel-ubuntu24.04 as base

Conversation

gouzil commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

待完成

是否引起精度变化

Uh oh!

paddle-bot Bot commented Apr 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gouzil commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gouzil commented Apr 24, 2026

Uh oh!

SigureMo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

risemeup1111 left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

risemeup1111 left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

risemeup1111 left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

risemeup1111 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

Choose a reason for hiding this comment

gouzil commented Apr 19, 2026 •

edited

Loading

gouzil commented Apr 22, 2026 •

edited

Loading