Skip to content

Add AMD/ROCm GPU cluster support#80

Draft
coketaste wants to merge 4 commits intofacebookresearch:mainfrom
coketaste:coketaste/gcm-amd-gpu-rocm
Draft

Add AMD/ROCm GPU cluster support#80
coketaste wants to merge 4 commits intofacebookresearch:mainfrom
coketaste:coketaste/gcm-amd-gpu-rocm

Conversation

@coketaste
Copy link

Implement ROCm stack support alongside existing NVIDIA tooling so GCM can monitor and health-check AMD GPU nodes using amd-smi or rocm-smi.

  • Add ROCm device telemetry backend (device_telemetry_rocm.py)
    • DeviceTelemetryClient/GPUDevice via amd-smi (preferred) or rocm-smi
    • Subprocess + JSON; map metrics to existing schema; safe defaults for ECC/retired pages/row remap/vbios
  • Extend get_gpu_devices() to use ROCR_VISIBLE_DEVICES for prolog/epilog (after SLURM_JOB_GPUS and CUDA_VISIBLE_DEVICES)
  • Add AMD SMI health check (check_amd_smi): gpu_num, running_procs, running_procs_and_kill, clock_freq, gpu_temperature, gpu_mem_usage; use EnvCtx(ROCR_VISIBLE_DEVICES) where needed
  • Add rocm_monitor CLI (gcm rocm_monitor) mirroring nvml_monitor
  • Add HealthCheckName.AMD_SMI_* and disable_amd_smi_* feature flags
  • Config: [health_checks.check-amd-smi], feature_example.toml
  • systemd: rocm_monitor.service and fair_cluster_rocm_resources.slice
  • Optional pyproject extra: rocm (no extra deps; amd-smi/rocm-smi on PATH)
  • Tests: ROCm telemetry, get_gpu_devices (ROCR_VISIBLE_DEVICES), check_amd_smi, rocm_monitor
  • Docs: README Possible Expansions, website rocm_monitor collector

Summary

Test Plan

Implement ROCm stack support alongside existing NVIDIA tooling so GCM
can monitor and health-check AMD GPU nodes using amd-smi or rocm-smi.

- Add ROCm device telemetry backend (device_telemetry_rocm.py)
  - DeviceTelemetryClient/GPUDevice via amd-smi (preferred) or rocm-smi
  - Subprocess + JSON; map metrics to existing schema; safe defaults for
    ECC/retired pages/row remap/vbios
- Extend get_gpu_devices() to use ROCR_VISIBLE_DEVICES for prolog/epilog
  (after SLURM_JOB_GPUS and CUDA_VISIBLE_DEVICES)
- Add AMD SMI health check (check_amd_smi): gpu_num, running_procs,
  running_procs_and_kill, clock_freq, gpu_temperature, gpu_mem_usage;
  use EnvCtx(ROCR_VISIBLE_DEVICES) where needed
- Add rocm_monitor CLI (gcm rocm_monitor) mirroring nvml_monitor
- Add HealthCheckName.AMD_SMI_* and disable_amd_smi_* feature flags
- Config: [health_checks.check-amd-smi], feature_example.toml
- systemd: rocm_monitor.service and fair_cluster_rocm_resources.slice
- Optional pyproject extra: rocm (no extra deps; amd-smi/rocm-smi on PATH)
- Tests: ROCm telemetry, get_gpu_devices (ROCR_VISIBLE_DEVICES),
  check_amd_smi, rocm_monitor
- Docs: README Possible Expansions, website rocm_monitor collector
@meta-cla
Copy link

meta-cla bot commented Mar 2, 2026

Hi @coketaste!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@github-actions
Copy link

github-actions bot commented Mar 2, 2026

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow What it runs
GPU Cluster Monitoring Python CI lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command Description Requires approval?
/metaci tests Runs Meta internal integration tests (pytest) Yes — a maintainer must trigger the command and approve the deployment request
/metaci integration tests Same as above (alias) Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

jen kins and others added 3 commits March 2, 2026 02:34
…=..., doc=e.doc, pos=e.pos) everywhere. The single skipped test is the known, intentional skip for fsacct in test_cli.
…rocm_monitor

- device_telemetry_rocm: Extract JSON from amd-smi stdout when it prints warnings or non-JSON text before the payload; handle top-level array response from "amd-smi list --json"; raise clear error when no JSON.

- rocm_monitor: Guard memory and power percent calculations against zero (memory_total or power_limit) to prevent ZeroDivisionError.

- .gitignore: Ignore *.err (e.g. SLURM stderr logs).
check-nccl. It runs RCCL (ROCm Communication Collectives Library)
tests (e.g. from ROCm/rccl-tests) to validate collective performance
and correctness on AMD GPU nodes.

- Add check_rccl command: single/pairwise/pairwise-quick flavors,
  --rccl-tdir, --rccl-topts, --critical-threshold, --warn-threshold,
  ROCm-friendly default env (HSA_FORCE_FINE_GRAIN_PCIE, GPU_DEVICE_ORDINAL, etc.)
- Schema: RCCL_TESTS in HealthCheckName; feature flag disable_rccl_tests
- Register check_rccl in health_checks CLI and checks __init__
- Tests: test_check_rccl.py (get_avg_bus_bw, process_rccl_test_output,
  check_rccl success/failure/exception, get_hosts); killswitch test
- Docs: check-rccl section in health_checks README; config.toml example
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants