Skip to content

FrankZhu888/GPUDiag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 

Repository files navigation

GPUDiag - HPC GPU Server Diagnostic Tool

GPUDiag is a Python-based diagnostic utility designed for High-Performance Computing (HPC) environments running NVIDIA Data Center GPUs (e.g., A100, H100, H800). It performs a comprehensive health check on the GPU ecosystem, identifying common hardware, software, and configuration issues that cause training instability.

πŸš€ Key Features

GPUDiag automates the manual troubleshooting process by checking:

  • GPU Drop Detection: Compares physical PCIe devices (lspci) against the driver-recognized count (nvidia-smi).
  • Version Compatibility:
    • Verifies NVIDIA Driver vs. Fabric Manager version consistency (crucial for NVLink).
    • Checks if NVCC version exceeds the driver's supported CUDA version.
  • Hardware Health:
    • Detects PCIe link width degradation (e.g., running at x8 instead of x16).
    • Monitors Temperature (>85Β°C) and Uncorrected ECC errors.
  • NVLink Integrity: Checks for inactive links and accumulates error counters (CRC, Recovery, Fatal).
  • Process Hygiene: Identifies "Zombie Processes" (PIDs holding VRAM that no longer exist in the OS).
  • Critical Logs: Scans dmesg for recent critical Xid errors (GPU hardware failures).
  • Network Status: Checks RDMA/InfiniBand port status (ibv_devinfo).

πŸ“‹ Prerequisites

  • OS: Linux (Tested on Ubuntu/CentOS).
  • Python: Python 3.6+.
  • Permissions: Root privileges are required (to access dmesg, lspci, and system services).
  • Dependencies:
    • nvidia-smi
    • lspci (pciutils)
    • ibv_devinfo (infiniband-diags) - Optional, for RDMA check

πŸ› οΈ Installation

Simply clone the repository or download the script directly. No complex pip installation is required.

git clone https://github.com/YourUsername/GPUDiag.git
cd GPUDiag
chmod +x GPUDiag.py

Usage

Run the tool using sudo. The tool outputs a JSON report followed by a human-readable summary.

sudo ./GPUDiag.py

Output Example

ζˆͺ屏2025-12-25 11 50 43 ζˆͺ屏2025-12-25 11 51 10

πŸ” Diagnostic Logic

Category Check Item Pass Criteria Fail/Warning Condition
System Root Privileges os.getuid() == 0 Script aborts if not run with sudo.
Hardware GPU Drop Detection lspci count == nvidia-smi count FAIL: Driver sees fewer GPUs than physically installed on PCIe bus.
Hardware PCIe Health PCIe Current Width == Max Width (e.g., x16) FAIL: Link degradation detected (e.g., operating at x8 or x4).
Hardware Thermals & ECC Temp ≀ 85Β°C, Uncorrected ECC == 0 WARNING: Temp > 85Β°C
FAIL: Uncorrected ECC errors > 0.
Software Fabric Manager Service Active & Version matches Driver FAIL: Service inactive OR Major version mismatch (e.g., Driver 535 vs FM 525).
Software CUDA Compatibility nvcc version ≀ Driver supported CUDA version FAIL: Compiler version is too new for the installed driver.
Interconnect NVLink Status Link Status != "Inactive", Error Counts == 0 WARNING: Inactive links detected.
FAIL: CRC/Recovery/Fatal errors > 0.
Process Zombie Processes All GPU-consuming PIDs exist in /proc FAIL: Process holding VRAM found in nvidia-smi but does not exist in OS.
Logs Xid Errors No recent "NVRM: Xid" in dmesg FAIL: Critical hardware error patterns (Xid) found in kernel logs.
Network RDMA Status InfiniBand ports state == "PORT_ACTIVE" FAIL: RDMA ports found in "DOWN" state.

Support Contact

For issues or questions, contact:

Frank Zhu flankeroot@gmail.com

About

GPUDiag is a lightweight, all-in-one diagnostic tool designed for HPC GPU clusters to automatically detect hardware drops, version incompatibilities, NVLink errors, zombie processes, and critical Xid faults.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages