Skip to content

【Hackathon 10th Spring No.51】Environment Adaptation support Paddle on CUDA 13.2#78720

Open
gouzil wants to merge 23 commits into
PaddlePaddle:developfrom
gouzil:cuda/support_cu132
Open

【Hackathon 10th Spring No.51】Environment Adaptation support Paddle on CUDA 13.2#78720
gouzil wants to merge 23 commits into
PaddlePaddle:developfrom
gouzil:cuda/support_cu132

Conversation

@gouzil
Copy link
Copy Markdown
Member

@gouzil gouzil commented Apr 19, 2026

PR Category

Environment Adaptation

PR Types

Improvements

Description

该 PR 用于推进 Paddle 对 CUDA 13.2 的适配,当前以 Draft / WIP 形式提交,便于继续补齐编译与验证。

本次分支主要包含以下改动:

  1. 补充 CUDA 13.x 的 GPU arch 配置,并在 cuda.cmake 中接入 13.x 工具链分支
    2. 调整 warpctcwarprnnt 透传的 NVCC flags,过滤 CUDA 13.x 不再支持的旧 gencode 目标
  2. 在 CUDA 13.x 默认构建中暂时关闭当前不稳定的 flash-attn 和部分 FP8 kernel 路径,规避已知第三方构建和编译器问题
  3. 调整 kernel 注册与 range/arange 显式实例化写法,绕过 nvcc 13.x 的内部编译错误
  4. tools/dockerfile/manylinux/common/install_cuda.sh 中新增 13.2 安装分支
  5. 新增 CUDA 13.2 对应的 cuDNNNCCLTensorRTcuSPARSELt 安装逻辑
  6. tools/dockerfile/manylinux/Dockerfile 中新增 cuda13.2 stage,调用 install_cuda.sh 13.2
  7. 这个 pr 同时也支持了 x86 和 arm 共用同一个 Dockerfile

待完成

是否引起精度变化

Copilot AI review requested due to automatic review settings April 19, 2026 16:54
@gouzil gouzil requested a review from zhangbo9674 as a code owner April 19, 2026 16:54
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 19, 2026

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@gouzil gouzil removed the request for review from zhangbo9674 April 19, 2026 16:55
@gouzil gouzil force-pushed the cuda/support_cu132 branch from 1660260 to 4c1032b Compare April 19, 2026 17:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR advances PaddlePaddle’s CUDA 13.2 environment/toolchain adaptation by updating CUDA arch configuration, adjusting third-party NVCC flag handling for removed GPU targets, and applying workarounds for CUDA 13.x compiler issues (e.g., kernel registration / explicit instantiation).

Changes:

  • Add CUDA 13.x GPU arch lists and route CUDA 13.x toolchain behavior in cmake/cuda.cmake.
  • Adjust build logic to avoid CUDA 13.x build breakages by disabling unstable components/paths (flash-attn, some FP8 kernels) and updating kernel template instantiation patterns.
  • Add a CUDA 13.2 manylinux build Dockerfile and refine third-party (warpctc/warprnnt) NVCC flags to drop unsupported gencode targets.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tools/dockerfile/manylinux/Dockerfile-132 Adds a CUDA 13.2 manylinux build image with Python 3.12, toolchain utilities, and NCCL packages.
paddle/phi/kernels/gpu/range_kernel.cu Replaces decltype-based explicit instantiation with standard explicit instantiation declarations for CUDA 13.x compatibility.
paddle/phi/kernels/gpu/arange_kernel.cu Same explicit-instantiation adjustment as range_kernel.cu for CUDA 13.x.
paddle/phi/kernels/CMakeLists.txt Skips selected FP8 CUDA kernel sources when building with CUDA ≥ 13.0 to avoid NVCC internal errors.
paddle/phi/core/kernel_registry.h Introduces a CUDA 13.x-specific workaround for template instantiation during kernel registration to avoid NVCC cudafe++ crashes.
cmake/third_party.cmake Disables flash-attn for CUDA 13.x default builds due to toolchain instability.
cmake/external/warprnnt.cmake Filters unsupported legacy -gencode targets from NVCC flags for CUDA ≥ 13.0 when building warprnnt.
cmake/external/warpctc.cmake Filters unsupported legacy -gencode targets from NVCC flags for CUDA ≥ 13.0 when building warpctc.
cmake/cuda.cmake Adds CUDA 13 arch presets and selects CUDA 13-specific arch lists for NVCC flag generation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +5 to +7
ARG BASE_TARGET=cuda${CUDA_VERSION}

FROM nvcr.io/nvidia/cuda:13.2.0-cudnn-devel-ubuntu24.04 as base
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA_VERSION / BASE_TARGET build args are defined but the base image tag is hard-coded (FROM nvcr.io/nvidia/cuda:13.2.0-...). This makes overrides like --build-arg CUDA_VERSION=... ineffective and leaves BASE_TARGET unused. Consider either wiring the args into FROM (e.g., via FROM ...:${CUDA_VERSION}...) or removing the unused args to avoid confusion in the build pipeline.

Suggested change
ARG BASE_TARGET=cuda${CUDA_VERSION}
FROM nvcr.io/nvidia/cuda:13.2.0-cudnn-devel-ubuntu24.04 as base
FROM nvcr.io/nvidia/cuda:${CUDA_VERSION}.0-cudnn-devel-ubuntu24.04 as base

Copilot uses AI. Check for mistakes.
Comment thread cmake/third_party.cmake Outdated
Comment on lines +607 to +611
if(${CMAKE_CUDA_COMPILER_VERSION} GREATER_EQUAL 13.0)
message(
STATUS
"flash-attn is disabled for default CUDA 13.x builds because the bundled third_party/flashattn source build is not yet stable with this toolchain."
)
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CUDA >= 13.0 the code prints that flash-attn is disabled, but it doesn't explicitly force WITH_FLASHATTN / WITH_FLASHATTN_V3 to OFF. If a user (or cached CMake value) sets WITH_FLASHATTN=ON, later if(WITH_FLASHATTN) blocks (e.g., in paddle/phi/CMakeLists.txt) can try to add dependencies on a flashattn target that was never created in this branch, causing configuration/build failures. Consider explicitly setting WITH_FLASHATTN and WITH_FLASHATTN_V3 to OFF in this branch.

Copilot uses AI. Check for mistakes.
@paddle-bot paddle-bot Bot added the contributor External developers label Apr 19, 2026
Comment on lines -132 to +138
template decltype(RangeNullaryKernel<int64_t, GPUContext>) RangeNullaryKernel;
template decltype(RangeNullaryKernel<int, GPUContext>) RangeNullaryKernel;
template void RangeNullaryKernel<int64_t, GPUContext>(const GPUContext&,
const int64_t,
const int64_t,
const int64_t,
DenseTensor*);
template void RangeNullaryKernel<int, GPUContext>(
const GPUContext&, const int, const int, const int, DenseTensor*);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

最小复现案例,同样的 case 1,在 cuda 13.0 可以,在 cuda 13.2 会报错 internal error: assertion failed at: "types.h", line 413 in rout_type_supp

template <typename Function>
struct KernelArgsParseFunctor;

template <typename Return, typename... Args>
struct KernelArgsParseFunctor<Return (*)(Args...)> {
  static void Parse(int, int) {}
};

template <typename Function, Function function>
struct KernelImpl {
  static void Compute() {}
  static void VariadicCompute() {}
};

struct KernelRegistrar {
  KernelRegistrar(const char*,
                  void (*)(int, int),
                  void (*)(int, int),
                  void (*)(),
                  void*) {}
};

template <typename T, typename Context>
void ShortKernel(Context, int) {}

#if CASE == 1
template decltype(ShortKernel<float, int>) ShortKernel<float, int>;
#elif CASE == 2
using FunctionType = decltype(ShortKernel<float, int>);
FunctionType* function_ptr = &ShortKernel<float, int>;
#elif CASE == 3
using FunctionPtrType = decltype(&ShortKernel<float, int>);
static auto* compute_ptr =
    &KernelImpl<FunctionPtrType, &ShortKernel<float, int>>::Compute;
static void* variadic_ptr = reinterpret_cast<void*>(
    &KernelImpl<FunctionPtrType, &ShortKernel<float, int>>::VariadicCompute);
#elif CASE == 4
template void ShortKernel<float, int>(int, int);
#elif CASE == 5
static void ProbeArgsDef(int, int) {}
using RegisterFunction = decltype(&ShortKernel<float, int>);
static const KernelRegistrar probe_registrar(
    "probe",
    &KernelArgsParseFunctor<RegisterFunction>::Parse,
    &ProbeArgsDef,
    &KernelImpl<RegisterFunction, &ShortKernel<float, int>>::Compute,
    reinterpret_cast<void*>(
        &KernelImpl<RegisterFunction, &ShortKernel<float, int>>::
            VariadicCompute));
#else
int unused = 0;
#endif

@gouzil
Copy link
Copy Markdown
Member Author

gouzil commented Apr 22, 2026

编译了个测试版本,cmake 命令如下 (记得修改一下后缀):

cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_GPU=ON \
    -DWITH_SHARED_PHI=ON -DWITH_TENSORRT=ON -DWITH_OPENVINO=OFF -DWITH_ROCM=OFF -DWITH_CINN=ON \
    -DWITH_DISTRIBUTE=ON -DWITH_MKL=ON -DWITH_AVX=ON -DCUDA_ARCH_NAME=Manual -DNEW_RELEASE_PYPI=OFF -DNEW_RELEASE_ALL=OFF \
    -DNEW_RELEASE_JIT=OFF -DWITH_PYTHON=ON -DCUDNN_ROOT=/usr/ -DWITH_TESTING=OFF -DWITH_COVERAGE=OFF -DWITH_INCREMENTAL_COVERAGE=OFF \
    -DCMAKE_MODULE_PATH=/opt/rocm/hip/cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DWITH_INFERENCE_API_TEST=OFF \
    -DINFERENCE_DEMO_INSTALL_DIR=/root/.cache/inference_demo -DPY_VERSION=3.12 -DCMAKE_INSTALL_PREFIX=/paddle/build \
    -DWITH_PSLIB= -DWITH_GLOO=ON -DWITH_XPU=OFF -DWITH_IPU=OFF -DXPU_SDK_ROOT= -DWITH_XPU_BKCL=OFF -DWITH_XPU_XHPC=OFF -DWITH_XPU_XFT=OFF \
    -DWITH_XPU_XRE5=OFF -DWITH_XPU_FFT=OFF -DWITH_ARM=OFF -DWITH_STRIP=ON -DON_INFER=ON -DCUDA_ARCH_BIN="75 80 86 90 100 103 120" -DWITH_RECORD_BUILDTIME=OFF \
    -DWITH_UNITY_BUILD=OFF -DWITH_ONNXRUNTIME=OFF -DWITH_CUDNN_FRONTEND=OFF -DWITH_CPP_TEST=OFF -DWITH_FA_BUILD_WITH_CACHE=OFF

https://github.com/gouzil/Paddle/releases/download/v3.5-cu132/paddlepaddle_gpu-3.5.0.dev20260421-cp312-cp312-linux_x86_64.whl.1

ENV WITH_GPU=${WITH_GPU:-ON}
ENV WITH_AVX=${WITH_AVX:-ON}
ENV DEBIAN_FRONTEND=noninteractive
ENV LD_LIBRARY_PATH=/usr/local/cuda-13.2/compat:/usr/local/cuda-13.2/targets/x86_64-linux/lib:/usr/local/cuda-13.2/lib64:$LD_LIBRARY_PATH
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里直接写死的原因是基础镜像已经有了 CUDA_VERSION 并且是 3 位版本号,而动态库只会有 2 位版本号。并且我们其实是对不同 cuda 版本单独写的 Dockerfile 所以能这么干

其他 Dockerfile 应该也有同样的问题才对

相关链接:https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/13.2.1/ubuntu2404/base/Dockerfile?ref_type=heads#L29

ENV WITH_GPU=${WITH_GPU:-ON}
ENV WITH_AVX=${WITH_AVX:-ON}
ENV DEBIAN_FRONTEND=noninteractive
ENV LD_LIBRARY_PATH=/usr/local/cuda-13.2/compat:/usr/local/cuda-13.2/targets/x86_64-linux/lib:/usr/local/cuda-13.2/lib64:$LD_LIBRARY_PATH
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 dockerfile 可以同时考虑 x86 和 arm 么?避免 arm 还要单独维护一份

@gouzil
Copy link
Copy Markdown
Member Author

gouzil commented Apr 24, 2026

有概率第一次 cmake 会出现下面的报错

CMake Error at cmake/cinn/core.cmake:27 (add_library):
  Cannot find source file:

    /paddle/build/paddle/cinn/hlir/dialect/operator/ir/cinn_op.cc

  Tried extensions .c .C .c++ .cc .cpp .cxx .cu .mpp .m .M .mm .ixx .cppm
  .ccm .cxxm .c++m .h .hh .h++ .hm .hpp .hxx .in .txx .f .F .for .f77 .f90
  .f95 .f03 .hip .ispc
Call Stack (most recent call first):
  paddle/cinn/hlir/dialect/operator/ir/CMakeLists.txt:61 (cinn_cc_library)


CMake Error at cmake/cinn/core.cmake:27 (add_library):
  No SOURCES given to target: cinn_op_dialect
Call Stack (most recent call first):
  paddle/cinn/hlir/dialect/operator/ir/CMakeLists.txt:61 (cinn_cc_library)


CMake Generate step failed.  Build files cannot be regenerated correctly.

Copy link
Copy Markdown
Member

@SigureMo SigureMo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review from @codex (GPT-5.5 xhigh).

整体方向 OK,但我会先卡一下 CUDA 13.2 arch 覆盖问题:这个 PR 的目标是 CUDA 13.2 适配,当前默认发布 arch 集合还没有覆盖 nvcc 13.2 已支持的部分 Blackwell targets。

Blocking:

  1. cmake/cuda.cmake 里的 CUDA 13 默认 arch 集合只有 75 80 86 90 100CUDA_ARCH_NAME=Blackwell 也仍只映射到 100。CUDA 13.2 nvcc 官方支持列表已经包含 sm_103/sm_110/sm_120/sm_121 等 targets;这样 CUDA_ARCH_NAME=All 的 CUDA 13.2 wheel 默认不会带这些 targets。建议补齐 release arch 策略,或至少明确用 PTX/JIT 覆盖。参考:https://docs.nvidia.com/cuda/archive/13.2.0/pdf/CUDA_Compiler_Driver_NVCC.pdf

Non-blocking but should sync:
2. PR 描述还写着 Draft / WIP,并说 CUDA 13.x 默认构建暂时关闭 flash-attn;但 PR 现在不是 draft,当前 cmake/third_party.cmake 对 CUDA >= 13.0 仍会在 arch >= 80 时 include external/flashattn 并打开 WITH_FLASHATTN。建议同步 PR body,避免 reviewer 误判。
3. Dockerfile-132CUDA_VERSION / BASE_TARGET 仍是摆设:FROMLD_LIBRARY_PATH 都硬编码 13.2。如果这是单版本 Dockerfile,可以删掉 arg;如果希望可复用,就要真正接入这些 args。

Comment thread cmake/cuda.cmake
set(paddle_known_gpu_archs10 "50 52 60 61 70 75")
set(paddle_known_gpu_archs11 "50 60 61 70 75 80")
set(paddle_known_gpu_archs12 "50 60 61 70 75 80 90 100")
set(paddle_known_gpu_archs13 "75 80 86 90 100")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review from @codex (GPT-5.5 xhigh).

这里的 CUDA 13 默认 arch 集合只有 75 80 86 90 100,后面的 CUDA_ARCH_NAME=Blackwell 也仍只映射到 100。CUDA 13.2 nvcc 官方支持列表已经包含 sm_103/sm_110/sm_120/sm_121 等 targets;如果 CUDA 13.2 release wheel 仍用 CUDA_ARCH_NAME=All,这些 targets 默认不会被编进 wheel。建议补齐 CUDA 13 的 release arch 策略,或明确用 PTX/JIT 覆盖这些新架构。

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 @risemeup1 @swgu98 确认下,感觉至少 103 得加一下

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 优先级:P0

Follow-up:当前 head b562e19d8d4432387a58fa8fa901debfb6fe6d5c 是 empty commit,CUDA 13 arch 配置没有变化。cmake/cuda.cmake:22 仍只有 75 80 86 90 100cmake/cuda.cmake:195CUDA_ARCH_NAME=Blackwell 仍只映射到 100,因此 CUDA 13.2 release/Blackwell 默认包仍不会覆盖前面指出的新增 Blackwell targets。请补齐 CUDA 13.2 release arch 策略,或明确加入 PTX/JIT 覆盖策略。

ARG CUDA_VERSION=13.2
ARG BASE_TARGET=cuda${CUDA_VERSION}

FROM nvcr.io/nvidia/cuda:13.2.0-cudnn-devel-ubuntu24.04 as base
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review from @codex (GPT-5.5 xhigh).

CUDA_VERSION / BASE_TARGET 在上面定义了,但这里 FROM 仍硬编码 13.2.0,下面 LD_LIBRARY_PATH 也硬编码 13.2。如果这个文件就是单版本 Dockerfile,建议删掉这些 unused args;如果希望保留可配置语义,就要把它们真正接进 FROM 和路径里。

@gouzil gouzil changed the title [CUDA13.2] Environment Adaptation support Paddle on CUDA 13.2 【Hackathon 10th Spring No.51】[CUDA13.2] Environment Adaptation support Paddle on CUDA 13.2 May 19, 2026
@gouzil gouzil changed the title 【Hackathon 10th Spring No.51】[CUDA13.2] Environment Adaptation support Paddle on CUDA 13.2 【Hackathon 10th Spring No.51】Environment Adaptation support Paddle on CUDA 13.2 May 19, 2026
Copy link
Copy Markdown

@risemeup1111 risemeup1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary: thanks for continuing the CUDA 13.2 adaptation. I still see two blockers before this can land: the release arch defaults do not cover CUDA 13.2's supported Blackwell targets, and the flash-attn submodule now depends on a personal fork/unmerged commit.

Blocking findings:

  1. cmake/cuda.cmake still defines the CUDA 13 release arch sets as only 75 80 86 90 100 (cmake/cuda.cmake:22, also the other CUDA 13 sets at lines 14, 30, 38, and 44), and CUDA_ARCH_NAME=Blackwell still maps only to 100 (cmake/cuda.cmake:194). CUDA 13.2 nvcc documents additional Blackwell real targets including sm_103, sm_110, sm_120, and sm_121 (https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html). With the current defaults, CUDA_ARCH_NAME=All / release builds will not emit cubins for these CUDA 13.2 targets, so e.g. SM120/SM121 devices are not actually covered by the CUDA 13.2 wheel strategy. Please either add the intended CUDA 13.2 target set or explicitly add a PTX/JIT strategy that covers them.

  2. .gitmodules changes third_party/flashattn from the PaddlePaddle upstream to https://github.com/gouzil/flash-attention.git (.gitmodules:73-75), and the submodule is advanced to bda9b377..., which is the head of the still-open PaddlePaddle/flash-attention#153. This makes Paddle's build depend on a contributor fork rather than a reviewed PaddlePaddle-owned dependency. Please land the flash-attention change in the PaddlePaddle repo first, then point this submodule back at https://github.com/PaddlePaddle/flash-attention.git and update to that accepted commit.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@risemeup1111 risemeup1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

由 Nyanpasu 驱动,AI 生成的代码审查建议,请维护者结合上下文仔细甄别。

Follow-up 结论:当前 head 06697b25724422296ac0a8b7e4ba98e3776937ac 的 checks 已通过,本轮新增提交主要调整了 Dockerfile-132 的依赖和 LD_LIBRARY_PATH,我没有发现新的独立 blocker。但上轮两个 blocker 仍未解决,因此本轮仍是 REQUEST_CHANGES。为避免重复已有 active inline thread,本轮没有在相同行新增 inline comment。

上轮问题状态:

  • CUDA 13.2 arch 覆盖:未解决。cmake/cuda.cmake:22 仍是 75 80 86 90 100cmake/cuda.cmake:195Blackwell 仍只映射到 100
  • flash-attn 子模块来源:未解决。.gitmodules:75 仍指向 https://github.com/gouzil/flash-attention.gitthird_party/flashattn:1 仍是 bda9b377...;关联的 PaddlePaddle/flash-attention#153 当前仍为 OPEN。

P0 优先级:P0 cmake/cuda.cmake:22cmake/cuda.cmake:195 仍未覆盖 CUDA 13.2 已支持的更多 Blackwell targets,release CUDA_ARCH_NAME=All / CUDA_ARCH_NAME=Blackwell 包仍不会默认产出这些目标的 cubin。请补齐 CUDA 13.2 release arch 策略,或明确加入 PTX/JIT 覆盖策略。未新增 inline:同位置已有 active thread PRRT_kwDOA-qtos6B-4qW

P0 优先级:P0 .gitmodules:75third_party/flashattn:1 仍让 Paddle 主仓依赖个人 fork 上的未合入 flash-attn 提交。请先将 flash-attention 变更合入 PaddlePaddle/flash-attention,再把子模块 URL 改回 https://github.com/PaddlePaddle/flash-attention.git 并更新到已合入 commit。未新增 inline:同位置已有 active thread PRRT_kwDOA-qtos6EvdxN

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@risemeup1111 risemeup1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

由 Nyanpasu 驱动,AI 生成的代码审查建议,请维护者结合上下文仔细甄别。

结论:当前 head b562e19d8d4432387a58fa8fa901debfb6fe6d5c 相比上次 review 的 06697b25724422296ac0a8b7e4ba98e3776937ac 是 empty commit,代码没有变化。上轮两个 P0 blocker 仍未解决,且当前 Check approval 仍失败,因此继续 REQUEST_CHANGES。

状态索引:

  • P0 PRRT_kwDOA-qtos6B-4qW / cmake/cuda.cmake:22, cmake/cuda.cmake:195 — 未解决:已在线程回复 PRRC_kwDOA-qtos7E9hfM 补充状态;请补齐 CUDA 13.2 release/Blackwell arch 策略或 PTX/JIT 覆盖。
  • P0 PRRT_kwDOA-qtos6EvdxN / .gitmodules:75, third_party/flashattn:1 — 未解决:已在线程回复 PRRC_kwDOA-qtos7E9hfs 补充状态;请先合入 PaddlePaddle/flash-attention 官方提交,再将 submodule 指回官方仓库和已合入 commit。
  • CI — 未完成:Check approval 当前失败,多个检查仍 queued/in_progress;需要在阻塞项修复后重新观察完整 CI。

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@risemeup1111 risemeup1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

由 Nyanpasu 驱动,AI 生成的代码审查建议,请维护者结合上下文仔细甄别。

结论:本轮复查 head aa3920f0c43660700e1ba68e63a4ddb38fbc908b。上轮 flash-attn 子模块来源问题已修复;CUDA 13.2 arch 覆盖问题仍未解决,已在当前 diff 行新增 1 条 P0 inline review comment,详细问题请看 inline comment。当前决定:REQUEST_CHANGES。

本轮 inline comments:1 条(P0)。CI 当前仍未全绿:Check approvalCoverage build 失败,部分检查仍在运行。

Comment thread cmake/cuda.cmake
set(paddle_known_gpu_archs10 "50 52 60 61 70 75")
set(paddle_known_gpu_archs11 "50 60 61 70 75 80")
set(paddle_known_gpu_archs12 "50 60 61 70 75 80 90 100")
set(paddle_known_gpu_archs13 "75 80 86 90 100")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 优先级:P0

当前 head 仍未补齐 CUDA 13.2 的 release/Blackwell arch 策略:这里的 CUDA 13 release arch 仍只有 75 80 86 90 100,并且同文件 CUDA_ARCH_NAME=Blackwell 仍只映射到 100。CUDA 13.2 nvcc 已支持更多 Blackwell targets(如 sm_103/sm_110/sm_120/sm_121),因此 CUDA_ARCH_NAME=All 或 Blackwell release 包仍不会默认产出这些目标的 cubin。请补齐 CUDA 13.2 的 arch 列表,或明确加入可接受的 PTX/JIT 覆盖策略。

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-27 10:00:00

📋 Review 摘要

PR 概述:为 Paddle 添加 CUDA 13.2 完整环境适配,包含 cmake GPU arch 配置、内核注册宏兼容性修复、Dockerfile 和安装脚本新增 CUDA 13.2 分支、Python 打包依赖更新。
变更范围cmake/paddle/phi/core/kernel_registry.hpaddle/phi/kernels/gpu/setup.pypython/setup.py.intools/dockerfile/
影响面 Tag[Environment Adaptation] [Operator Mechanism]

问题

级别 文件 概述
🔴 Bug tools/dockerfile/manylinux/common/install_cuda.sh:70 install_cusparselt_090_cuda13 硬编码 x86_64 路径,aarch64 构建失败
🔴 Bug tools/dockerfile/manylinux/common/install_cuda.sh:111 install_nccl_2297_cuda132 使用 yum,但 Dockerfile-132 基于 Ubuntu,yum 不存在
🔴 Bug setup.py:1398 cuda_major_version 始终为 None,CUDA 13.2 TensorRT 适配逻辑永远不执行
🟡 建议 setup.py:1406 非 CUDA 13.2 路径的 TensorRT 列表缺少 tensorrt==10.3.0(setup.py.in 已有)

📝 PR 规范检查

PR 描述四个必填 section(PR Category、PR Types、Description、是否引起精度变化)均已填写,内容具体,规范合规。标题未使用 [Tag] 格式,但包含语义信息,属可接受范围。

总体评价

该 PR 对 CUDA 13.2 的适配整体思路正确,cmake arch 配置、内核宏兼容修复、FlashAttn 版本门控等改动合理。但存在 3 个 🔴 P0 问题:aarch64 cuSPARSELt 下载缺失平台分支、NCCL 安装脚本包管理器不匹配、cuda_major_version 未赋值导致 CUDA 13.2 TensorRT 依赖逻辑空转。建议修复后合入。

# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.9.0.3_cuda13-archive.tar.xz
tar xf libcusparse_lt-linux-x86_64-0.9.0.3_cuda13-archive.tar.xz
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug install_cusparselt_090_cuda13 函数硬编码了 linux-x86_64 路径,aarch64 架构下会下载失败。

当前代码:

wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.9.0.3_cuda13-archive.tar.xz

PR 描述提到 Dockerfile-132 同时支持 x86 和 arm,但此函数没有平台分支逻辑。建议修复方式:

ARCH=$(uname -m)
if [ "$ARCH" = "aarch64" ]; then
    CUSPARSELT_ARCH="linux-aarch64"
    CUSPARSELT_PKG="libcusparse_lt-linux-aarch64-0.9.0.3_cuda13-archive"
else
    CUSPARSELT_ARCH="linux-x86_64"
    CUSPARSELT_PKG="libcusparse_lt-linux-x86_64-0.9.0.3_cuda13-archive"
fi
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/${CUSPARSELT_ARCH}/${CUSPARSELT_PKG}.tar.xz


function install_nccl_2297_cuda132 {
yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
yum install -y \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug install_nccl_2297_cuda132 使用了 yum-config-manageryum install,但该函数在 install_132 中被调用,而 Dockerfile-132 基于 Ubuntu 24.04(APT 系)。yum 命令在 Ubuntu 镜像中不存在,会导致构建失败。

Dockerfile-132 已在 apt-get install 阶段安装了 libnccl2=2.29.7-1+cuda13.2,建议删除 install_132 中对 install_nccl_2297_cuda132 的调用,或将 install_nccl_2297_cuda132 改为 APT 实现:

function install_nccl_2297_cuda132 {
    apt-get update
    apt-get install -y libnccl2=2.29.7-1+cuda13.2 libnccl-dev=2.29.7-1+cuda13.2
}

Comment thread setup.py
)
if platform.system() == 'Linux' and cuda_major_version == '13.2':
if not version_str and platform.machine() == 'aarch64':
return paddle_cuda_requires, ["tensorrt-cu13==10.16.1.11"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug cuda_major_version 变量始终为 None(声明后从未赋值),导致 if platform.system() == 'Linux' and cuda_major_version == '13.2': 条件永远不成立,CUDA 13.2 的 TensorRT tensorrt-cu13==10.16.1.11 适配逻辑永远不会被执行。

需要在 PADDLE_CUDA_INSTALL_REQUIREMENTS 填充后,根据实际 CUDA 版本给 cuda_major_version 赋值,例如:

for cuda_ver in PADDLE_CUDA_INSTALL_REQUIREMENTS:
    # 通过环境变量或 nvcc 版本确定当前 CUDA 版本
    pass
# 或从构建环境变量读取:
cuda_major_version = env_dict.get("CUDA_MAJOR_VERSION")  # 如 '13.2'

同样的问题也存在于 python/setup.py.in 中。

Comment thread setup.py
platform.system() == 'Windows'
and version_default is not None
and version_default >= 10
):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 非 CUDA 13.2 路径的 PADDLE_TENSORRT_INSTALL_REQUIREMENTS 列表中缺少 tensorrt==10.3.0

对比 python/setup.py.in 中已包含该版本(见 setup.py.in 第 886-892 行),但 setup.py 的对应位置(新增的 elif 分支里)只有:

"tensorrt==8.5.3.1",
"tensorrt==8.6.0",
"tensorrt==8.6.1.post1",
# 缺少 "tensorrt==10.3.0"

建议补充 "tensorrt==10.3.0" 以与 setup.py.in 保持一致。

@SigureMo
Copy link
Copy Markdown
Member

@ShigureNyako 把这个 PR cherry-pick 到 release/3.4 吧,emmm,可能有点麻烦,你自己 squash 成一个 commit 后 cherry-pick 吧

@ShigureNyako
Copy link
Copy Markdown
Contributor

已按要求把当前 head aa3920f0c43660700e1ba68e63a4ddb38fbc908b 的改动 squash 成一个 backport commit,并开到 release/3.4

说明:release/3.4 上补丁干净应用,无文本冲突;本地已跑 git diff HEAD --checkbash -n tools/dockerfile/manylinux/common/install_cuda.shpython3 -m py_compile setup.py 和 CUDA 13.2 相关轻量一致性检查。完整 CUDA/Paddle build 本机环境不可行;Docker BuildKit --check 因本机 Docker daemon 未运行无法执行。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants