GPUDiag is a Python-based diagnostic utility designed for High-Performance Computing (HPC) environments running NVIDIA Data Center GPUs (e.g., A100, H100, H800). It performs a comprehensive health check on the GPU ecosystem, identifying common hardware, software, and configuration issues that cause training instability.
GPUDiag automates the manual troubleshooting process by checking:
- GPU Drop Detection: Compares physical PCIe devices (
lspci) against the driver-recognized count (nvidia-smi). - Version Compatibility:
- Verifies NVIDIA Driver vs. Fabric Manager version consistency (crucial for NVLink).
- Checks if NVCC version exceeds the driver's supported CUDA version.
- Hardware Health:
- Detects PCIe link width degradation (e.g., running at x8 instead of x16).
- Monitors Temperature (>85Β°C) and Uncorrected ECC errors.
- NVLink Integrity: Checks for inactive links and accumulates error counters (CRC, Recovery, Fatal).
- Process Hygiene: Identifies "Zombie Processes" (PIDs holding VRAM that no longer exist in the OS).
- Critical Logs: Scans
dmesgfor recent critical Xid errors (GPU hardware failures). - Network Status: Checks RDMA/InfiniBand port status (
ibv_devinfo).
- OS: Linux (Tested on Ubuntu/CentOS).
- Python: Python 3.6+.
- Permissions: Root privileges are required (to access
dmesg,lspci, and system services). - Dependencies:
nvidia-smilspci(pciutils)ibv_devinfo(infiniband-diags) - Optional, for RDMA check
Simply clone the repository or download the script directly. No complex pip installation is required.
git clone https://github.com/YourUsername/GPUDiag.git
cd GPUDiag
chmod +x GPUDiag.pyRun the tool using sudo. The tool outputs a JSON report followed by a human-readable summary.
sudo ./GPUDiag.py
| Category | Check Item | Pass Criteria | Fail/Warning Condition |
|---|---|---|---|
| System | Root Privileges | os.getuid() == 0 |
Script aborts if not run with sudo. |
| Hardware | GPU Drop Detection | lspci count == nvidia-smi count |
FAIL: Driver sees fewer GPUs than physically installed on PCIe bus. |
| Hardware | PCIe Health | PCIe Current Width == Max Width (e.g., x16) | FAIL: Link degradation detected (e.g., operating at x8 or x4). |
| Hardware | Thermals & ECC | Temp β€ 85Β°C, Uncorrected ECC == 0 | WARNING: Temp > 85Β°C FAIL: Uncorrected ECC errors > 0. |
| Software | Fabric Manager | Service Active & Version matches Driver | FAIL: Service inactive OR Major version mismatch (e.g., Driver 535 vs FM 525). |
| Software | CUDA Compatibility | nvcc version β€ Driver supported CUDA version |
FAIL: Compiler version is too new for the installed driver. |
| Interconnect | NVLink Status | Link Status != "Inactive", Error Counts == 0 | WARNING: Inactive links detected. FAIL: CRC/Recovery/Fatal errors > 0. |
| Process | Zombie Processes | All GPU-consuming PIDs exist in /proc |
FAIL: Process holding VRAM found in nvidia-smi but does not exist in OS. |
| Logs | Xid Errors | No recent "NVRM: Xid" in dmesg |
FAIL: Critical hardware error patterns (Xid) found in kernel logs. |
| Network | RDMA Status | InfiniBand ports state == "PORT_ACTIVE" | FAIL: RDMA ports found in "DOWN" state. |
For issues or questions, contact:
Frank Zhu flankeroot@gmail.com