【Hackathon 10th Spring No.51】Environment Adaptation support Paddle on CUDA 13.2#78720
【Hackathon 10th Spring No.51】Environment Adaptation support Paddle on CUDA 13.2#78720gouzil wants to merge 23 commits into
Conversation
|
你的PR提交成功,感谢你对开源项目的贡献! |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
1660260 to
4c1032b
Compare
There was a problem hiding this comment.
Pull request overview
This PR advances PaddlePaddle’s CUDA 13.2 environment/toolchain adaptation by updating CUDA arch configuration, adjusting third-party NVCC flag handling for removed GPU targets, and applying workarounds for CUDA 13.x compiler issues (e.g., kernel registration / explicit instantiation).
Changes:
- Add CUDA 13.x GPU arch lists and route CUDA 13.x toolchain behavior in
cmake/cuda.cmake. - Adjust build logic to avoid CUDA 13.x build breakages by disabling unstable components/paths (flash-attn, some FP8 kernels) and updating kernel template instantiation patterns.
- Add a CUDA 13.2 manylinux build Dockerfile and refine third-party (warpctc/warprnnt) NVCC flags to drop unsupported
gencodetargets.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/dockerfile/manylinux/Dockerfile-132 | Adds a CUDA 13.2 manylinux build image with Python 3.12, toolchain utilities, and NCCL packages. |
| paddle/phi/kernels/gpu/range_kernel.cu | Replaces decltype-based explicit instantiation with standard explicit instantiation declarations for CUDA 13.x compatibility. |
| paddle/phi/kernels/gpu/arange_kernel.cu | Same explicit-instantiation adjustment as range_kernel.cu for CUDA 13.x. |
| paddle/phi/kernels/CMakeLists.txt | Skips selected FP8 CUDA kernel sources when building with CUDA ≥ 13.0 to avoid NVCC internal errors. |
| paddle/phi/core/kernel_registry.h | Introduces a CUDA 13.x-specific workaround for template instantiation during kernel registration to avoid NVCC cudafe++ crashes. |
| cmake/third_party.cmake | Disables flash-attn for CUDA 13.x default builds due to toolchain instability. |
| cmake/external/warprnnt.cmake | Filters unsupported legacy -gencode targets from NVCC flags for CUDA ≥ 13.0 when building warprnnt. |
| cmake/external/warpctc.cmake | Filters unsupported legacy -gencode targets from NVCC flags for CUDA ≥ 13.0 when building warpctc. |
| cmake/cuda.cmake | Adds CUDA 13 arch presets and selects CUDA 13-specific arch lists for NVCC flag generation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ARG BASE_TARGET=cuda${CUDA_VERSION} | ||
|
|
||
| FROM nvcr.io/nvidia/cuda:13.2.0-cudnn-devel-ubuntu24.04 as base |
There was a problem hiding this comment.
CUDA_VERSION / BASE_TARGET build args are defined but the base image tag is hard-coded (FROM nvcr.io/nvidia/cuda:13.2.0-...). This makes overrides like --build-arg CUDA_VERSION=... ineffective and leaves BASE_TARGET unused. Consider either wiring the args into FROM (e.g., via FROM ...:${CUDA_VERSION}...) or removing the unused args to avoid confusion in the build pipeline.
| ARG BASE_TARGET=cuda${CUDA_VERSION} | |
| FROM nvcr.io/nvidia/cuda:13.2.0-cudnn-devel-ubuntu24.04 as base | |
| FROM nvcr.io/nvidia/cuda:${CUDA_VERSION}.0-cudnn-devel-ubuntu24.04 as base |
| if(${CMAKE_CUDA_COMPILER_VERSION} GREATER_EQUAL 13.0) | ||
| message( | ||
| STATUS | ||
| "flash-attn is disabled for default CUDA 13.x builds because the bundled third_party/flashattn source build is not yet stable with this toolchain." | ||
| ) |
There was a problem hiding this comment.
For CUDA >= 13.0 the code prints that flash-attn is disabled, but it doesn't explicitly force WITH_FLASHATTN / WITH_FLASHATTN_V3 to OFF. If a user (or cached CMake value) sets WITH_FLASHATTN=ON, later if(WITH_FLASHATTN) blocks (e.g., in paddle/phi/CMakeLists.txt) can try to add dependencies on a flashattn target that was never created in this branch, causing configuration/build failures. Consider explicitly setting WITH_FLASHATTN and WITH_FLASHATTN_V3 to OFF in this branch.
| template decltype(RangeNullaryKernel<int64_t, GPUContext>) RangeNullaryKernel; | ||
| template decltype(RangeNullaryKernel<int, GPUContext>) RangeNullaryKernel; | ||
| template void RangeNullaryKernel<int64_t, GPUContext>(const GPUContext&, | ||
| const int64_t, | ||
| const int64_t, | ||
| const int64_t, | ||
| DenseTensor*); | ||
| template void RangeNullaryKernel<int, GPUContext>( | ||
| const GPUContext&, const int, const int, const int, DenseTensor*); |
There was a problem hiding this comment.
最小复现案例,同样的 case 1,在 cuda 13.0 可以,在 cuda 13.2 会报错 internal error: assertion failed at: "types.h", line 413 in rout_type_supp
template <typename Function>
struct KernelArgsParseFunctor;
template <typename Return, typename... Args>
struct KernelArgsParseFunctor<Return (*)(Args...)> {
static void Parse(int, int) {}
};
template <typename Function, Function function>
struct KernelImpl {
static void Compute() {}
static void VariadicCompute() {}
};
struct KernelRegistrar {
KernelRegistrar(const char*,
void (*)(int, int),
void (*)(int, int),
void (*)(),
void*) {}
};
template <typename T, typename Context>
void ShortKernel(Context, int) {}
#if CASE == 1
template decltype(ShortKernel<float, int>) ShortKernel<float, int>;
#elif CASE == 2
using FunctionType = decltype(ShortKernel<float, int>);
FunctionType* function_ptr = &ShortKernel<float, int>;
#elif CASE == 3
using FunctionPtrType = decltype(&ShortKernel<float, int>);
static auto* compute_ptr =
&KernelImpl<FunctionPtrType, &ShortKernel<float, int>>::Compute;
static void* variadic_ptr = reinterpret_cast<void*>(
&KernelImpl<FunctionPtrType, &ShortKernel<float, int>>::VariadicCompute);
#elif CASE == 4
template void ShortKernel<float, int>(int, int);
#elif CASE == 5
static void ProbeArgsDef(int, int) {}
using RegisterFunction = decltype(&ShortKernel<float, int>);
static const KernelRegistrar probe_registrar(
"probe",
&KernelArgsParseFunctor<RegisterFunction>::Parse,
&ProbeArgsDef,
&KernelImpl<RegisterFunction, &ShortKernel<float, int>>::Compute,
reinterpret_cast<void*>(
&KernelImpl<RegisterFunction, &ShortKernel<float, int>>::
VariadicCompute));
#else
int unused = 0;
#endif
|
编译了个测试版本,cmake 命令如下 (记得修改一下后缀): cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_GPU=ON \
-DWITH_SHARED_PHI=ON -DWITH_TENSORRT=ON -DWITH_OPENVINO=OFF -DWITH_ROCM=OFF -DWITH_CINN=ON \
-DWITH_DISTRIBUTE=ON -DWITH_MKL=ON -DWITH_AVX=ON -DCUDA_ARCH_NAME=Manual -DNEW_RELEASE_PYPI=OFF -DNEW_RELEASE_ALL=OFF \
-DNEW_RELEASE_JIT=OFF -DWITH_PYTHON=ON -DCUDNN_ROOT=/usr/ -DWITH_TESTING=OFF -DWITH_COVERAGE=OFF -DWITH_INCREMENTAL_COVERAGE=OFF \
-DCMAKE_MODULE_PATH=/opt/rocm/hip/cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DWITH_INFERENCE_API_TEST=OFF \
-DINFERENCE_DEMO_INSTALL_DIR=/root/.cache/inference_demo -DPY_VERSION=3.12 -DCMAKE_INSTALL_PREFIX=/paddle/build \
-DWITH_PSLIB= -DWITH_GLOO=ON -DWITH_XPU=OFF -DWITH_IPU=OFF -DXPU_SDK_ROOT= -DWITH_XPU_BKCL=OFF -DWITH_XPU_XHPC=OFF -DWITH_XPU_XFT=OFF \
-DWITH_XPU_XRE5=OFF -DWITH_XPU_FFT=OFF -DWITH_ARM=OFF -DWITH_STRIP=ON -DON_INFER=ON -DCUDA_ARCH_BIN="75 80 86 90 100 103 120" -DWITH_RECORD_BUILDTIME=OFF \
-DWITH_UNITY_BUILD=OFF -DWITH_ONNXRUNTIME=OFF -DWITH_CUDNN_FRONTEND=OFF -DWITH_CPP_TEST=OFF -DWITH_FA_BUILD_WITH_CACHE=OFF |
| ENV WITH_GPU=${WITH_GPU:-ON} | ||
| ENV WITH_AVX=${WITH_AVX:-ON} | ||
| ENV DEBIAN_FRONTEND=noninteractive | ||
| ENV LD_LIBRARY_PATH=/usr/local/cuda-13.2/compat:/usr/local/cuda-13.2/targets/x86_64-linux/lib:/usr/local/cuda-13.2/lib64:$LD_LIBRARY_PATH |
There was a problem hiding this comment.
这里直接写死的原因是基础镜像已经有了 CUDA_VERSION 并且是 3 位版本号,而动态库只会有 2 位版本号。并且我们其实是对不同 cuda 版本单独写的 Dockerfile 所以能这么干
其他 Dockerfile 应该也有同样的问题才对
| ENV WITH_GPU=${WITH_GPU:-ON} | ||
| ENV WITH_AVX=${WITH_AVX:-ON} | ||
| ENV DEBIAN_FRONTEND=noninteractive | ||
| ENV LD_LIBRARY_PATH=/usr/local/cuda-13.2/compat:/usr/local/cuda-13.2/targets/x86_64-linux/lib:/usr/local/cuda-13.2/lib64:$LD_LIBRARY_PATH |
There was a problem hiding this comment.
这个 dockerfile 可以同时考虑 x86 和 arm 么?避免 arm 还要单独维护一份
|
有概率第一次 cmake 会出现下面的报错 CMake Error at cmake/cinn/core.cmake:27 (add_library):
Cannot find source file:
/paddle/build/paddle/cinn/hlir/dialect/operator/ir/cinn_op.cc
Tried extensions .c .C .c++ .cc .cpp .cxx .cu .mpp .m .M .mm .ixx .cppm
.ccm .cxxm .c++m .h .hh .h++ .hm .hpp .hxx .in .txx .f .F .for .f77 .f90
.f95 .f03 .hip .ispc
Call Stack (most recent call first):
paddle/cinn/hlir/dialect/operator/ir/CMakeLists.txt:61 (cinn_cc_library)
CMake Error at cmake/cinn/core.cmake:27 (add_library):
No SOURCES given to target: cinn_op_dialect
Call Stack (most recent call first):
paddle/cinn/hlir/dialect/operator/ir/CMakeLists.txt:61 (cinn_cc_library)
CMake Generate step failed. Build files cannot be regenerated correctly. |
…e CUDA 13.x compatibility issue of flash-attn
SigureMo
left a comment
There was a problem hiding this comment.
Review from @codex (GPT-5.5 xhigh).
整体方向 OK,但我会先卡一下 CUDA 13.2 arch 覆盖问题:这个 PR 的目标是 CUDA 13.2 适配,当前默认发布 arch 集合还没有覆盖 nvcc 13.2 已支持的部分 Blackwell targets。
Blocking:
cmake/cuda.cmake里的 CUDA 13 默认 arch 集合只有75 80 86 90 100,CUDA_ARCH_NAME=Blackwell也仍只映射到100。CUDA 13.2 nvcc 官方支持列表已经包含sm_103/sm_110/sm_120/sm_121等 targets;这样CUDA_ARCH_NAME=All的 CUDA 13.2 wheel 默认不会带这些 targets。建议补齐 release arch 策略,或至少明确用 PTX/JIT 覆盖。参考:https://docs.nvidia.com/cuda/archive/13.2.0/pdf/CUDA_Compiler_Driver_NVCC.pdf
Non-blocking but should sync:
2. PR 描述还写着 Draft / WIP,并说 CUDA 13.x 默认构建暂时关闭 flash-attn;但 PR 现在不是 draft,当前 cmake/third_party.cmake 对 CUDA >= 13.0 仍会在 arch >= 80 时 include external/flashattn 并打开 WITH_FLASHATTN。建议同步 PR body,避免 reviewer 误判。
3. Dockerfile-132 里 CUDA_VERSION / BASE_TARGET 仍是摆设:FROM 和 LD_LIBRARY_PATH 都硬编码 13.2。如果这是单版本 Dockerfile,可以删掉 arg;如果希望可复用,就要真正接入这些 args。
| set(paddle_known_gpu_archs10 "50 52 60 61 70 75") | ||
| set(paddle_known_gpu_archs11 "50 60 61 70 75 80") | ||
| set(paddle_known_gpu_archs12 "50 60 61 70 75 80 90 100") | ||
| set(paddle_known_gpu_archs13 "75 80 86 90 100") |
There was a problem hiding this comment.
Review from @codex (GPT-5.5 xhigh).
这里的 CUDA 13 默认 arch 集合只有 75 80 86 90 100,后面的 CUDA_ARCH_NAME=Blackwell 也仍只映射到 100。CUDA 13.2 nvcc 官方支持列表已经包含 sm_103/sm_110/sm_120/sm_121 等 targets;如果 CUDA 13.2 release wheel 仍用 CUDA_ARCH_NAME=All,这些 targets 默认不会被编进 wheel。建议补齐 CUDA 13 的 release arch 策略,或明确用 PTX/JIT 覆盖这些新架构。
There was a problem hiding this comment.
Follow-up:当前 head b562e19d8d4432387a58fa8fa901debfb6fe6d5c 是 empty commit,CUDA 13 arch 配置没有变化。cmake/cuda.cmake:22 仍只有 75 80 86 90 100,cmake/cuda.cmake:195 的 CUDA_ARCH_NAME=Blackwell 仍只映射到 100,因此 CUDA 13.2 release/Blackwell 默认包仍不会覆盖前面指出的新增 Blackwell targets。请补齐 CUDA 13.2 release arch 策略,或明确加入 PTX/JIT 覆盖策略。
| ARG CUDA_VERSION=13.2 | ||
| ARG BASE_TARGET=cuda${CUDA_VERSION} | ||
|
|
||
| FROM nvcr.io/nvidia/cuda:13.2.0-cudnn-devel-ubuntu24.04 as base |
There was a problem hiding this comment.
Review from @codex (GPT-5.5 xhigh).
CUDA_VERSION / BASE_TARGET 在上面定义了,但这里 FROM 仍硬编码 13.2.0,下面 LD_LIBRARY_PATH 也硬编码 13.2。如果这个文件就是单版本 Dockerfile,建议删掉这些 unused args;如果希望保留可配置语义,就要把它们真正接进 FROM 和路径里。
… for checking the CUDA version
risemeup1111
left a comment
There was a problem hiding this comment.
Summary: thanks for continuing the CUDA 13.2 adaptation. I still see two blockers before this can land: the release arch defaults do not cover CUDA 13.2's supported Blackwell targets, and the flash-attn submodule now depends on a personal fork/unmerged commit.
Blocking findings:
-
cmake/cuda.cmakestill defines the CUDA 13 release arch sets as only75 80 86 90 100(cmake/cuda.cmake:22, also the other CUDA 13 sets at lines 14, 30, 38, and 44), andCUDA_ARCH_NAME=Blackwellstill maps only to100(cmake/cuda.cmake:194). CUDA 13.2 nvcc documents additional Blackwell real targets includingsm_103,sm_110,sm_120, andsm_121(https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html). With the current defaults,CUDA_ARCH_NAME=All/ release builds will not emit cubins for these CUDA 13.2 targets, so e.g. SM120/SM121 devices are not actually covered by the CUDA 13.2 wheel strategy. Please either add the intended CUDA 13.2 target set or explicitly add a PTX/JIT strategy that covers them. -
.gitmoduleschangesthird_party/flashattnfrom the PaddlePaddle upstream tohttps://github.com/gouzil/flash-attention.git(.gitmodules:73-75), and the submodule is advanced tobda9b377..., which is the head of the still-open PaddlePaddle/flash-attention#153. This makes Paddle's build depend on a contributor fork rather than a reviewed PaddlePaddle-owned dependency. Please land the flash-attention change in the PaddlePaddle repo first, then point this submodule back athttps://github.com/PaddlePaddle/flash-attention.gitand update to that accepted commit.
risemeup1111
left a comment
There was a problem hiding this comment.
由 Nyanpasu 驱动,AI 生成的代码审查建议,请维护者结合上下文仔细甄别。
Follow-up 结论:当前 head 06697b25724422296ac0a8b7e4ba98e3776937ac 的 checks 已通过,本轮新增提交主要调整了 Dockerfile-132 的依赖和 LD_LIBRARY_PATH,我没有发现新的独立 blocker。但上轮两个 blocker 仍未解决,因此本轮仍是 REQUEST_CHANGES。为避免重复已有 active inline thread,本轮没有在相同行新增 inline comment。
上轮问题状态:
- CUDA 13.2 arch 覆盖:未解决。
cmake/cuda.cmake:22仍是75 80 86 90 100,cmake/cuda.cmake:195的Blackwell仍只映射到100。 - flash-attn 子模块来源:未解决。
.gitmodules:75仍指向https://github.com/gouzil/flash-attention.git,third_party/flashattn:1仍是bda9b377...;关联的PaddlePaddle/flash-attention#153当前仍为 OPEN。
优先级:P0
cmake/cuda.cmake:22 和 cmake/cuda.cmake:195 仍未覆盖 CUDA 13.2 已支持的更多 Blackwell targets,release CUDA_ARCH_NAME=All / CUDA_ARCH_NAME=Blackwell 包仍不会默认产出这些目标的 cubin。请补齐 CUDA 13.2 release arch 策略,或明确加入 PTX/JIT 覆盖策略。未新增 inline:同位置已有 active thread PRRT_kwDOA-qtos6B-4qW。
优先级:P0
.gitmodules:75 和 third_party/flashattn:1 仍让 Paddle 主仓依赖个人 fork 上的未合入 flash-attn 提交。请先将 flash-attention 变更合入 PaddlePaddle/flash-attention,再把子模块 URL 改回 https://github.com/PaddlePaddle/flash-attention.git 并更新到已合入 commit。未新增 inline:同位置已有 active thread PRRT_kwDOA-qtos6EvdxN。
risemeup1111
left a comment
There was a problem hiding this comment.
由 Nyanpasu 驱动,AI 生成的代码审查建议,请维护者结合上下文仔细甄别。
结论:当前 head b562e19d8d4432387a58fa8fa901debfb6fe6d5c 相比上次 review 的 06697b25724422296ac0a8b7e4ba98e3776937ac 是 empty commit,代码没有变化。上轮两个 P0 blocker 仍未解决,且当前 Check approval 仍失败,因此继续 REQUEST_CHANGES。
状态索引:
PRRT_kwDOA-qtos6B-4qW/cmake/cuda.cmake:22,cmake/cuda.cmake:195— 未解决:已在线程回复PRRC_kwDOA-qtos7E9hfM补充状态;请补齐 CUDA 13.2 release/Blackwell arch 策略或 PTX/JIT 覆盖。PRRT_kwDOA-qtos6EvdxN/.gitmodules:75,third_party/flashattn:1— 未解决:已在线程回复PRRC_kwDOA-qtos7E9hfs补充状态;请先合入 PaddlePaddle/flash-attention 官方提交,再将 submodule 指回官方仓库和已合入 commit。- CI — 未完成:
Check approval当前失败,多个检查仍 queued/in_progress;需要在阻塞项修复后重新观察完整 CI。
risemeup1111
left a comment
There was a problem hiding this comment.
由 Nyanpasu 驱动,AI 生成的代码审查建议,请维护者结合上下文仔细甄别。
结论:本轮复查 head aa3920f0c43660700e1ba68e63a4ddb38fbc908b。上轮 flash-attn 子模块来源问题已修复;CUDA 13.2 arch 覆盖问题仍未解决,已在当前 diff 行新增 1 条 P0 inline review comment,详细问题请看 inline comment。当前决定:REQUEST_CHANGES。
本轮 inline comments:1 条(P0)。CI 当前仍未全绿:Check approval 和 Coverage build 失败,部分检查仍在运行。
| set(paddle_known_gpu_archs10 "50 52 60 61 70 75") | ||
| set(paddle_known_gpu_archs11 "50 60 61 70 75 80") | ||
| set(paddle_known_gpu_archs12 "50 60 61 70 75 80 90 100") | ||
| set(paddle_known_gpu_archs13 "75 80 86 90 100") |
There was a problem hiding this comment.
当前 head 仍未补齐 CUDA 13.2 的 release/Blackwell arch 策略:这里的 CUDA 13 release arch 仍只有 75 80 86 90 100,并且同文件 CUDA_ARCH_NAME=Blackwell 仍只映射到 100。CUDA 13.2 nvcc 已支持更多 Blackwell targets(如 sm_103/sm_110/sm_120/sm_121),因此 CUDA_ARCH_NAME=All 或 Blackwell release 包仍不会默认产出这些目标的 cubin。请补齐 CUDA 13.2 的 arch 列表,或明确加入可接受的 PTX/JIT 覆盖策略。
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-27 10:00:00
📋 Review 摘要
PR 概述:为 Paddle 添加 CUDA 13.2 完整环境适配,包含 cmake GPU arch 配置、内核注册宏兼容性修复、Dockerfile 和安装脚本新增 CUDA 13.2 分支、Python 打包依赖更新。
变更范围:cmake/、paddle/phi/core/kernel_registry.h、paddle/phi/kernels/gpu/、setup.py、python/setup.py.in、tools/dockerfile/
影响面 Tag:[Environment Adaptation] [Operator Mechanism]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | tools/dockerfile/manylinux/common/install_cuda.sh:70 |
install_cusparselt_090_cuda13 硬编码 x86_64 路径,aarch64 构建失败 |
| 🔴 Bug | tools/dockerfile/manylinux/common/install_cuda.sh:111 |
install_nccl_2297_cuda132 使用 yum,但 Dockerfile-132 基于 Ubuntu,yum 不存在 |
| 🔴 Bug | setup.py:1398 |
cuda_major_version 始终为 None,CUDA 13.2 TensorRT 适配逻辑永远不执行 |
| 🟡 建议 | setup.py:1406 |
非 CUDA 13.2 路径的 TensorRT 列表缺少 tensorrt==10.3.0(setup.py.in 已有) |
📝 PR 规范检查
PR 描述四个必填 section(PR Category、PR Types、Description、是否引起精度变化)均已填写,内容具体,规范合规。标题未使用 [Tag] 格式,但包含语义信息,属可接受范围。
总体评价
该 PR 对 CUDA 13.2 的适配整体思路正确,cmake arch 配置、内核宏兼容修复、FlashAttn 版本门控等改动合理。但存在 3 个 🔴 P0 问题:aarch64 cuSPARSELt 下载缺失平台分支、NCCL 安装脚本包管理器不匹配、cuda_major_version 未赋值导致 CUDA 13.2 TensorRT 依赖逻辑空转。建议修复后合入。
| # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html | ||
| mkdir tmp_cusparselt && pushd tmp_cusparselt | ||
| wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.9.0.3_cuda13-archive.tar.xz | ||
| tar xf libcusparse_lt-linux-x86_64-0.9.0.3_cuda13-archive.tar.xz |
There was a problem hiding this comment.
🔴 Bug install_cusparselt_090_cuda13 函数硬编码了 linux-x86_64 路径,aarch64 架构下会下载失败。
当前代码:
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.9.0.3_cuda13-archive.tar.xz
PR 描述提到 Dockerfile-132 同时支持 x86 和 arm,但此函数没有平台分支逻辑。建议修复方式:
ARCH=$(uname -m)
if [ "$ARCH" = "aarch64" ]; then
CUSPARSELT_ARCH="linux-aarch64"
CUSPARSELT_PKG="libcusparse_lt-linux-aarch64-0.9.0.3_cuda13-archive"
else
CUSPARSELT_ARCH="linux-x86_64"
CUSPARSELT_PKG="libcusparse_lt-linux-x86_64-0.9.0.3_cuda13-archive"
fi
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/${CUSPARSELT_ARCH}/${CUSPARSELT_PKG}.tar.xz
|
|
||
| function install_nccl_2297_cuda132 { | ||
| yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo | ||
| yum install -y \ |
There was a problem hiding this comment.
🔴 Bug install_nccl_2297_cuda132 使用了 yum-config-manager 和 yum install,但该函数在 install_132 中被调用,而 Dockerfile-132 基于 Ubuntu 24.04(APT 系)。yum 命令在 Ubuntu 镜像中不存在,会导致构建失败。
Dockerfile-132 已在 apt-get install 阶段安装了 libnccl2=2.29.7-1+cuda13.2,建议删除 install_132 中对 install_nccl_2297_cuda132 的调用,或将 install_nccl_2297_cuda132 改为 APT 实现:
function install_nccl_2297_cuda132 {
apt-get update
apt-get install -y libnccl2=2.29.7-1+cuda13.2 libnccl-dev=2.29.7-1+cuda13.2
}| ) | ||
| if platform.system() == 'Linux' and cuda_major_version == '13.2': | ||
| if not version_str and platform.machine() == 'aarch64': | ||
| return paddle_cuda_requires, ["tensorrt-cu13==10.16.1.11"] |
There was a problem hiding this comment.
🔴 Bug cuda_major_version 变量始终为 None(声明后从未赋值),导致 if platform.system() == 'Linux' and cuda_major_version == '13.2': 条件永远不成立,CUDA 13.2 的 TensorRT tensorrt-cu13==10.16.1.11 适配逻辑永远不会被执行。
需要在 PADDLE_CUDA_INSTALL_REQUIREMENTS 填充后,根据实际 CUDA 版本给 cuda_major_version 赋值,例如:
for cuda_ver in PADDLE_CUDA_INSTALL_REQUIREMENTS:
# 通过环境变量或 nvcc 版本确定当前 CUDA 版本
pass
# 或从构建环境变量读取:
cuda_major_version = env_dict.get("CUDA_MAJOR_VERSION") # 如 '13.2'同样的问题也存在于 python/setup.py.in 中。
| platform.system() == 'Windows' | ||
| and version_default is not None | ||
| and version_default >= 10 | ||
| ): |
There was a problem hiding this comment.
🟡 建议 非 CUDA 13.2 路径的 PADDLE_TENSORRT_INSTALL_REQUIREMENTS 列表中缺少 tensorrt==10.3.0。
对比 python/setup.py.in 中已包含该版本(见 setup.py.in 第 886-892 行),但 setup.py 的对应位置(新增的 elif 分支里)只有:
"tensorrt==8.5.3.1",
"tensorrt==8.6.0",
"tensorrt==8.6.1.post1",
# 缺少 "tensorrt==10.3.0"建议补充 "tensorrt==10.3.0" 以与 setup.py.in 保持一致。
|
@ShigureNyako 把这个 PR cherry-pick 到 release/3.4 吧,emmm,可能有点麻烦,你自己 squash 成一个 commit 后 cherry-pick 吧 |
|
已按要求把当前 head
说明:release/3.4 上补丁干净应用,无文本冲突;本地已跑 |
PR Category
Environment Adaptation
PR Types
Improvements
Description
该 PR 用于推进 Paddle 对 CUDA 13.2 的适配,当前以 Draft / WIP 形式提交,便于继续补齐编译与验证。
本次分支主要包含以下改动:
cuda.cmake中接入 13.x 工具链分支2. 调整warpctc和warprnnt透传的 NVCC flags,过滤 CUDA 13.x 不再支持的旧gencode目标flash-attn和部分 FP8 kernel 路径,规避已知第三方构建和编译器问题range/arange显式实例化写法,绕过 nvcc 13.x 的内部编译错误tools/dockerfile/manylinux/common/install_cuda.sh中新增13.2安装分支cuDNN、NCCL、TensorRT、cuSPARSELt安装逻辑tools/dockerfile/manylinux/Dockerfile中新增cuda13.2stage,调用install_cuda.sh 13.2Dockerfile待完成
Fix CUDA 13.2 flash attention build compatibility flash-attention#141supportcsrc/flash_attn_with_bias_and_mask/src/fmha/smem_tile.hcuda132 build flash-attention#153是否引起精度变化
否