From b901168169f41e448091cc70e1150a384f494ad5 Mon Sep 17 00:00:00 2001 From: yugang-amd Date: Fri, 7 Nov 2025 13:47:41 -0500 Subject: [PATCH 1/4] 30.20.0 post release updates --- docs/compatibility/compatibility-matrix.rst | 3 +- docs/documentation/release-notes.md | 133 ++++++++++++++++++++ 2 files changed, 134 insertions(+), 2 deletions(-) diff --git a/docs/compatibility/compatibility-matrix.rst b/docs/compatibility/compatibility-matrix.rst index 8f652cd..29cbe2a 100644 --- a/docs/compatibility/compatibility-matrix.rst +++ b/docs/compatibility/compatibility-matrix.rst @@ -6,8 +6,7 @@ Compatibility matrix ************************************************************************************** -The AMD GPU Driver (amdgpu) 30.10.2 is compatible with ROCm 7.0.x, 6.4.x, 6.3.x, and -6.2.x. For more information, see `User and kernel-space support matrix +The AMD GPU Driver (amdgpu) 30.10.2 is compatible with ROCm 7.1.x, 7.0.x, 6.4.x, and 6.3.x. For more information, see `User and kernel-space support matrix `__. ====================================== diff --git a/docs/documentation/release-notes.md b/docs/documentation/release-notes.md index a40dac8..10130bd 100644 --- a/docs/documentation/release-notes.md +++ b/docs/documentation/release-notes.md @@ -25,3 +25,136 @@ AMD GPU Driver 30.20.0 introduces support for Node Power Management (NPM) on AMD ## Resolved issues Resolved an issue where the GPU failed to recover after RAS (Reliability, Availability, and Serviceability) poison consumption. The fix applies to all AMD Instinct MI300 and MI350 Series GPUs. + +## AMD GPU Driver (amdgpu) 30.10.2 release notes + +The release notes provide release highlights and known issues since the previous AMD GPU Driver release (30.10.1). + +### Release highlights + +The following are notable new features and improvements in AMD GPU Driver 30.10.2. + +#### Operating system and hardware support changes + +The AMD GPU Driver 30.10.2 introduces support for the following operating systems: + +* Debian 13 (kernel: 6.12) +* Oracle Linux 10 (kernel: 6.12.0 [UEK]) +* RHEL 10.0 (kernel: 6.12.0-55) + +For the compatibility between AMD GPU Driver, ROCm, GPUs, and OS, see the [Compatibility matrix](../compatibility/compatibility-matrix.rst). + +#### GPU resiliency + +Multimedia Engine Reset is now supported in AMD GPU Driver (amdgpu) 30.10.2 for AMD Instinct MI300X GPUs. This finer-grain GPU resiliency feature allows recovery from faults related to VCN or JPEG without requiring a full GPU reset, thereby improving system stability and fault tolerance. Note that VCN queue reset functionality requires PLDM bundle 01.25.05.00 (or later) firmware. + +#### PCIe error recovery + +Downstream Port Containment (DPC) and Advanced Error Reporting (AER) are PCIe mechanisms that work together to detect and recover from hardware errors. Enabling DPC for AER allows the system to isolate faulty PCIe devices and recover gracefully without crashing. This feature is supported on the AMD Instinct MI300X, MI300A, and MI325X GPUs. + +### Known issues + +ROCm 7.0.2 and AMD GPU Driver (amdgpu) 30.10.2 have known multimedia issue with AMD Instinct MI300A GPUs when paired with BKC 26. As a result, subsequent multimedia jobs might become unstable after the multimedia engine reset is reset to recover a VCN related fault. As a workaround, continue using BKC 25 or older firmware until it is fixed in an upcoming release. + +## AMD GPU Driver (amdgpu) 30.10.1 release notes + +AMD GPU Driver (amdgpu) 30.10.1 is a quality release that resolves the issue listed in the Release highlights. + +### Release highlights + +The following issue has been resolved in the AMD GPU Driver (amdgpu) 30.10.1 to be used with ROCm 7.0.1. + +#### Failure to declare out-of-bound CPERs for bad memory page + +The issue of failing to declare Out-Of-Band Common Platform Error Records (CPERs) when exceeding bad memory page threshold has been resolved. The fix applies to all AMD Instinct MI300 Series and MI350 Series GPUs. + +```{note} +AMD GPU Driver (amdgpu) 30.10.1 doesn't include any other significant changes or feature additions. For comprehensive changes in the previous release, refer to the [AMD GPU Driver (amdgpu) 30.10 release notes](#amd-gpu-driver-amdgpu-30-10-release-notes) below. +``` + +## AMD GPU Driver (amdgpu) 30.10 release notes + +The release notes provide a summary of notable changes since the previous AMD GPU Driver release. + +### Release highlights + +The following are notable new features and improvements in AMD GPU Driver 30.10. + +#### Operating system and hardware support changes + +The AMD GPU Driver 30.10 adds support for [AMD Instinct +MI355X](https://www.amd.com/en/products/accelerators/instinct/mi350/mi355x.html) and +[MI350X](https://www.amd.com/en/products/accelerators/instinct/mi350/mi350x.html) accelerators. + +AMD GPU Driver 30.10 also introduces support for the following operating systems: + +* Rocky 9 + +* Ubuntu 24.04.3 + +AMD GPU Driver 30.10 marks end-of-support (EOS) for Ubuntu 24.04.2. For the compatibility between +AMD GPU Driver, ROCm, GPUs, and OS, see the [Compatibility matrix](../../compatibility/compatibility-matrix.rst). + +#### Partitioning + +The AMD GPU Driver 30.10 adds the following memory and compute partitioning support: + +* **NPS1 + SPX partitioning for AMD Instinct MI355X and MI350X**: This memory partitioning mode exposes the + entire memory to all compute dies (XCDs), allowing full access across the GPU. In SPX (Single + Partition Compute Mode), workgroups are distributed round-robin across all XCDs. There’s no explicit + control over which XCD executes a given kernel, making it simple and general-purpose. This feature + requires PLDM bundle (firmware) 01.25.13.04. + +* **NPS2 + DPX partitioning for AMD Instinct MI355X and MI350X**: NPS2 splits the GPU’s memory into two NUMA + domains. Dual Partition Compute Mode (DPX) divides the GPU’s compute resources into two partitions, + each with 4 XCDs (out of 8 total), 8 DMA engines, and 2 VCN decoder groups. This feature requires + PLDM bundle (firmware) 01.25.13.04. + +#### GPU resiliency + +The following GPU resiliency feature is supported in the AMD GPU Driver 30.10 for AMD Instinct MI300X, MI350X, and MI355X: + +* SDMA engine reset enables recovery from SDMA-related faults without requiring a full GPU reset, + improving system stability and fault tolerance. + +#### Program counter (PC) sampling + +The AMD GPU Driver 30.10.0 adds support for Stochastic (hardware-based) and Host-trap PC sampling, a +GPU profiling technique used for analyzing kernel execution performance. + +* **Stochastic PC sampling**: This method randomly triggers wave traps across compute units to + capture program counter (PC) snapshots. This method introduces randomness in wave selection, + enabling broader statistical coverage of kernel execution behavior. This feature is supported on the + MI300-Series GPUs (including MI300A, MI300X, MI325X, MI350X, and MI355X). + +* **Host-trap PC Sampling**: This method allows controlled, device-wide profiling. It works by + periodically selecting active wave slots across compute units and triggering a trap handler to + capture the program counter (PC), producing a histogram of sampled instructions. This feature is + supported on the MI200-Series GPUs (including MI210, MI250, and MI250X) and MI300-Series GPUs + (including MI300A, MI300X, MI325X, MI350X, and MI355X). + +This feature can be accessed either through the ROCprofiler method or directly via the ROCm Runtime +vendor extension APIs, which are defined in the `hsa_ven_amd_pc_sampling.h` header as follows: + +```c +hsa_status_t hsa_ven_amd_pcs_create(hsa_agent_t agent, hsa_ven_amd_pcs_method_kind_t method, + hsa_ven_amd_pcs_units_t units, size_t interval, size_t latency, + size_t buffer_size + hsa_ven_amd_pcs_data_ready_callback_t data_ready_callback, + void* client_callback_data, hsa_ven_amd_pcs_t* pc_sampling); +hsa_status_t hsa_ven_amd_pcs_destroy(hsa_ven_amd_pcs_t pc_sampling); +hsa_status_t hsa_ven_amd_pcs_start(hsa_ven_amd_pcs_t pc_sampling); +hsa_status_t hsa_ven_amd_pcs_stop(hsa_ven_amd_pcs_t pc_sampling); +hsa_status_t hsa_ven_amd_pcs_flush(hsa_ven_amd_pcs_t pc_sampling); +``` + +### Known issues + +Exceeding bad memory page threshold fails to declare Out-Of-Band Common +Platform Error Records (CPERs). This issue affects all AMD Instinct MI350 +Series and MI300 Series GPUs, and will be fixed in a future AMD GPU +Driver release. + +### Resolved issues + +Issue with restoring a CRIU checkpoint for workloads on AMD Instinct MI Series GPUs is resolved. \ No newline at end of file From d3f7c0fbf513aa4caa43a77f122a16c57d82767e Mon Sep 17 00:00:00 2001 From: yugang-amd Date: Fri, 7 Nov 2025 13:51:39 -0500 Subject: [PATCH 2/4] attempt to fix linting error --- docs/documentation/release-notes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/documentation/release-notes.md b/docs/documentation/release-notes.md index 10130bd..41e33dc 100644 --- a/docs/documentation/release-notes.md +++ b/docs/documentation/release-notes.md @@ -157,4 +157,4 @@ Driver release. ### Resolved issues -Issue with restoring a CRIU checkpoint for workloads on AMD Instinct MI Series GPUs is resolved. \ No newline at end of file +Issue with restoring a CRIU checkpoint for workloads on AMD Instinct MI Series GPUs is resolved. From f41ba7c86e78ce0e98237b39f281146b84422b8e Mon Sep 17 00:00:00 2001 From: yugang-amd Date: Fri, 7 Nov 2025 17:24:18 -0500 Subject: [PATCH 3/4] Update docs/compatibility/compatibility-matrix.rst Co-authored-by: Pratik Basyal --- docs/compatibility/compatibility-matrix.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/compatibility/compatibility-matrix.rst b/docs/compatibility/compatibility-matrix.rst index 29cbe2a..aa1060d 100644 --- a/docs/compatibility/compatibility-matrix.rst +++ b/docs/compatibility/compatibility-matrix.rst @@ -6,7 +6,7 @@ Compatibility matrix ************************************************************************************** -The AMD GPU Driver (amdgpu) 30.10.2 is compatible with ROCm 7.1.x, 7.0.x, 6.4.x, and 6.3.x. For more information, see `User and kernel-space support matrix +The AMD GPU Driver (amdgpu) 30.20.0 is compatible with ROCm 7.1.x, 7.0.x, 6.4.x, and 6.3.x. For more information, see `User and kernel-space support matrix `__. ====================================== From d518cb6d28bbf034e920998d58c6d87cc14aa0a5 Mon Sep 17 00:00:00 2001 From: yugang-amd Date: Fri, 7 Nov 2025 17:26:08 -0500 Subject: [PATCH 4/4] fix link --- docs/documentation/release-notes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/documentation/release-notes.md b/docs/documentation/release-notes.md index 41e33dc..9b079a3 100644 --- a/docs/documentation/release-notes.md +++ b/docs/documentation/release-notes.md @@ -93,7 +93,7 @@ AMD GPU Driver 30.10 also introduces support for the following operating systems * Ubuntu 24.04.3 AMD GPU Driver 30.10 marks end-of-support (EOS) for Ubuntu 24.04.2. For the compatibility between -AMD GPU Driver, ROCm, GPUs, and OS, see the [Compatibility matrix](../../compatibility/compatibility-matrix.rst). +AMD GPU Driver, ROCm, GPUs, and OS, see the [Compatibility matrix](../compatibility/compatibility-matrix.rst). #### Partitioning