Skip to content

feat(ROCm): Add BF16 support for conv kernels on HIP/ROCm#47

Merged
M4jupitercannon merged 2 commits into
ROCm:developfrom
fchange:hip-bf16-conv-support
Apr 14, 2026
Merged

feat(ROCm): Add BF16 support for conv kernels on HIP/ROCm#47
M4jupitercannon merged 2 commits into
ROCm:developfrom
fchange:hip-bf16-conv-support

Conversation

@fchange
Copy link
Copy Markdown

@fchange fchange commented Apr 13, 2026

Description

This PR adds bfloat16 (BF16) data type support for convolution kernels on AMD ROCm/HIP GPUs.

Problem

The PaddleOCR-VL model uses BF16 precision, but the native HIP/ROCm backend fails because conv kernels are not registered for BF16. This blocks running PaddleOCR-VL with the native backend on AMD GPUs.

Changes

1. paddle/phi/backends/gpu/rocm/miopen_desc.h

  • Added BFLOAT16 case to ToCudnnDataType() mapping to miopenBFloat16

2. paddle/phi/kernels/gpudnn/conv_kernel.cu

  • Registered phi::bfloat16 for conv2d kernel
  • Registered phi::bfloat16 for conv3d kernel
  • Registered phi::bfloat16 for depthwise_conv2d kernel

3. paddle/phi/kernels/gpudnn/conv_grad_kernel.cu

  • Registered phi::bfloat16 for conv2d_grad kernel
  • Registered phi::bfloat16 for conv3d_grad kernel
  • Registered phi::bfloat16 for conv2d_double_grad kernel
  • Registered phi::bfloat16 for conv3d_double_grad kernel
  • Registered phi::bfloat16 for depthwise_conv2d_double_grad kernel

4. test/legacy_test/test_hip_bf16_conv_kernel.py (new)

  • Added unit tests for BF16 conv2d forward and grouped conv on HIP

Motivation

This is a port of the same fix from PaddlePaddle/Paddle#78587 to the ROCm fork, enabling PaddleOCR-VL and other BF16 models to run on AMD ROCm GPUs using the native backend.

Testing

  • Added test_hip_bf16_conv_kernel.py with BF16 conv2d forward and grouped conv tests
  • Tests are gated behind core.is_compiled_with_rocm() check

cc: @PaddlePaddle/paddle-rocma

fchange and others added 2 commits April 13, 2026 10:11
Register bfloat16 data type for conv2d, conv3d, depthwise_conv2d
and their grad/double_grad kernels on HIP/ROCm platform.

Changes:
- Add BFLOAT16 case to ToCudnnDataType in miopen_desc.h
- Register phi::bfloat16 for conv2d, conv3d, depthwise_conv2d kernels
- Register phi::bfloat16 for conv2d_grad, conv3d_grad, conv2d_double_grad,
  conv3d_double_grad, depthwise_conv2d_double_grad kernels
- Add test_hip_bf16_conv_kernel.py for BF16 conv validation

This enables PaddleOCR-VL and other BF16 models to run on AMD ROCm GPUs
using the native backend.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Register bfloat16 for layer_norm and layer_norm_grad kernels on HIP.
This is required for PaddleOCR-VL native backend which uses BF16 precision.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
@fchange
Copy link
Copy Markdown
Author

fchange commented Apr 13, 2026

image image

@M4jupitercannon M4jupitercannon merged commit 29d1c6f into ROCm:develop Apr 14, 2026
1 check passed
austin1997 added a commit to austin1997/Paddle that referenced this pull request Apr 22, 2026
Restore the BF16 registrations for conv2d / conv3d / depthwise conv kernels
and the DataType::BFLOAT16 -> miopenBFloat16 mapping originally added by
ROCm#47 and reverted on paddle_hackthon ahead of RDNA4 enablement.

The change is gated at compile time by the existing #ifdef PADDLE_WITH_HIP
block. Deployment to archs that lack native BF16 support should be handled
via PADDLE_ROCM_OFFLOAD_ARCHS (paddle_hackthon's default list already
covers the BF16-capable set: CDNA3/gfx942, CDNA4/gfx950, RDNA3/gfx1100-
1102, RDNA4/gfx1200-1201); if a downstream target needs to strip BF16 from
the build, it can narrow the offload-arch list accordingly. No runtime
arch queries are introduced.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants