Add AMD/ROCm GPU cluster support#80
Conversation
Implement ROCm stack support alongside existing NVIDIA tooling so GCM
can monitor and health-check AMD GPU nodes using amd-smi or rocm-smi.
- Add ROCm device telemetry backend (device_telemetry_rocm.py)
- DeviceTelemetryClient/GPUDevice via amd-smi (preferred) or rocm-smi
- Subprocess + JSON; map metrics to existing schema; safe defaults for
ECC/retired pages/row remap/vbios
- Extend get_gpu_devices() to use ROCR_VISIBLE_DEVICES for prolog/epilog
(after SLURM_JOB_GPUS and CUDA_VISIBLE_DEVICES)
- Add AMD SMI health check (check_amd_smi): gpu_num, running_procs,
running_procs_and_kill, clock_freq, gpu_temperature, gpu_mem_usage;
use EnvCtx(ROCR_VISIBLE_DEVICES) where needed
- Add rocm_monitor CLI (gcm rocm_monitor) mirroring nvml_monitor
- Add HealthCheckName.AMD_SMI_* and disable_amd_smi_* feature flags
- Config: [health_checks.check-amd-smi], feature_example.toml
- systemd: rocm_monitor.service and fair_cluster_rocm_resources.slice
- Optional pyproject extra: rocm (no extra deps; amd-smi/rocm-smi on PATH)
- Tests: ROCm telemetry, get_gpu_devices (ROCR_VISIBLE_DEVICES),
check_amd_smi, rocm_monitor
- Docs: README Possible Expansions, website rocm_monitor collector
|
Hi @coketaste! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
CI CommandsThe following CI workflows run automatically on every push and pull request:
The following commands can be used by maintainers to trigger additional tests that require access to secrets:
|
…=..., doc=e.doc, pos=e.pos) everywhere. The single skipped test is the known, intentional skip for fsacct in test_cli.
…rocm_monitor - device_telemetry_rocm: Extract JSON from amd-smi stdout when it prints warnings or non-JSON text before the payload; handle top-level array response from "amd-smi list --json"; raise clear error when no JSON. - rocm_monitor: Guard memory and power percent calculations against zero (memory_total or power_limit) to prevent ZeroDivisionError. - .gitignore: Ignore *.err (e.g. SLURM stderr logs).
check-nccl. It runs RCCL (ROCm Communication Collectives Library) tests (e.g. from ROCm/rccl-tests) to validate collective performance and correctness on AMD GPU nodes. - Add check_rccl command: single/pairwise/pairwise-quick flavors, --rccl-tdir, --rccl-topts, --critical-threshold, --warn-threshold, ROCm-friendly default env (HSA_FORCE_FINE_GRAIN_PCIE, GPU_DEVICE_ORDINAL, etc.) - Schema: RCCL_TESTS in HealthCheckName; feature flag disable_rccl_tests - Register check_rccl in health_checks CLI and checks __init__ - Tests: test_check_rccl.py (get_avg_bus_bw, process_rccl_test_output, check_rccl success/failure/exception, get_hosts); killswitch test - Docs: check-rccl section in health_checks README; config.toml example
Implement ROCm stack support alongside existing NVIDIA tooling so GCM can monitor and health-check AMD GPU nodes using amd-smi or rocm-smi.
Summary
Test Plan