Add AMD/ROCm GPU cluster support by coketaste · Pull Request #80 · facebookresearch/gcm

coketaste · 2026-03-02T02:05:37Z

Implement ROCm stack support alongside existing NVIDIA tooling so GCM can monitor and health-check AMD GPU nodes using amd-smi or rocm-smi.

Add ROCm device telemetry backend (device_telemetry_rocm.py)
- DeviceTelemetryClient/GPUDevice via amd-smi (preferred) or rocm-smi
- Subprocess + JSON; map metrics to existing schema; safe defaults for ECC/retired pages/row remap/vbios
Extend get_gpu_devices() to use ROCR_VISIBLE_DEVICES for prolog/epilog (after SLURM_JOB_GPUS and CUDA_VISIBLE_DEVICES)
Add AMD SMI health check (check_amd_smi): gpu_num, running_procs, running_procs_and_kill, clock_freq, gpu_temperature, gpu_mem_usage; use EnvCtx(ROCR_VISIBLE_DEVICES) where needed
Add rocm_monitor CLI (gcm rocm_monitor) mirroring nvml_monitor
Add HealthCheckName.AMD_SMI_* and disable_amd_smi_* feature flags
Config: [health_checks.check-amd-smi], feature_example.toml
systemd: rocm_monitor.service and fair_cluster_rocm_resources.slice
Optional pyproject extra: rocm (no extra deps; amd-smi/rocm-smi on PATH)
Tests: ROCm telemetry, get_gpu_devices (ROCR_VISIBLE_DEVICES), check_amd_smi, rocm_monitor
Docs: README Possible Expansions, website rocm_monitor collector

Summary

Test Plan

Implement ROCm stack support alongside existing NVIDIA tooling so GCM can monitor and health-check AMD GPU nodes using amd-smi or rocm-smi. - Add ROCm device telemetry backend (device_telemetry_rocm.py) - DeviceTelemetryClient/GPUDevice via amd-smi (preferred) or rocm-smi - Subprocess + JSON; map metrics to existing schema; safe defaults for ECC/retired pages/row remap/vbios - Extend get_gpu_devices() to use ROCR_VISIBLE_DEVICES for prolog/epilog (after SLURM_JOB_GPUS and CUDA_VISIBLE_DEVICES) - Add AMD SMI health check (check_amd_smi): gpu_num, running_procs, running_procs_and_kill, clock_freq, gpu_temperature, gpu_mem_usage; use EnvCtx(ROCR_VISIBLE_DEVICES) where needed - Add rocm_monitor CLI (gcm rocm_monitor) mirroring nvml_monitor - Add HealthCheckName.AMD_SMI_* and disable_amd_smi_* feature flags - Config: [health_checks.check-amd-smi], feature_example.toml - systemd: rocm_monitor.service and fair_cluster_rocm_resources.slice - Optional pyproject extra: rocm (no extra deps; amd-smi/rocm-smi on PATH) - Tests: ROCm telemetry, get_gpu_devices (ROCR_VISIBLE_DEVICES), check_amd_smi, rocm_monitor - Docs: README Possible Expansions, website rocm_monitor collector

meta-cla · 2026-03-02T02:05:44Z

Hi @coketaste!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

github-actions · 2026-03-02T02:05:47Z

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow	What it runs
GPU Cluster Monitoring Python CI	lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI	shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command	Description	Requires approval?
`/metaci tests`	Runs Meta internal integration tests (pytest)	Yes — a maintainer must trigger the command and approve the deployment request
`/metaci integration tests`	Same as above (alias)	Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

…=..., doc=e.doc, pos=e.pos) everywhere. The single skipped test is the known, intentional skip for fsacct in test_cli.

…rocm_monitor - device_telemetry_rocm: Extract JSON from amd-smi stdout when it prints warnings or non-JSON text before the payload; handle top-level array response from "amd-smi list --json"; raise clear error when no JSON. - rocm_monitor: Guard memory and power percent calculations against zero (memory_total or power_limit) to prevent ZeroDivisionError. - .gitignore: Ignore *.err (e.g. SLURM stderr logs).

check-nccl. It runs RCCL (ROCm Communication Collectives Library) tests (e.g. from ROCm/rccl-tests) to validate collective performance and correctness on AMD GPU nodes. - Add check_rccl command: single/pairwise/pairwise-quick flavors, --rccl-tdir, --rccl-topts, --critical-threshold, --warn-threshold, ROCm-friendly default env (HSA_FORCE_FINE_GRAIN_PCIE, GPU_DEVICE_ORDINAL, etc.) - Schema: RCCL_TESTS in HealthCheckName; feature flag disable_rccl_tests - Register check_rccl in health_checks CLI and checks __init__ - Tests: test_check_rccl.py (get_avg_bus_bw, process_rccl_test_output, check_rccl success/failure/exception, get_hosts); killswitch test - Docs: check-rccl section in health_checks README; config.toml example

facebook-github-bot added the module: rocm label Mar 2, 2026

jen kins and others added 3 commits March 2, 2026 02:34

The three deprecation warnings are fixed by using TOMLDecodeError(msg…

62666da

…=..., doc=e.doc, pos=e.pos) everywhere. The single skipped test is the known, intentional skip for fsacct in test_cli.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AMD/ROCm GPU cluster support#80

Add AMD/ROCm GPU cluster support#80
coketaste wants to merge 4 commits intofacebookresearch:mainfrom
coketaste:coketaste/gcm-amd-gpu-rocm

coketaste commented Mar 2, 2026

Uh oh!

meta-cla bot commented Mar 2, 2026

Uh oh!

github-actions bot commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

coketaste commented Mar 2, 2026

Summary

Test Plan

Uh oh!

meta-cla bot commented Mar 2, 2026

Action Required

Process

Uh oh!

github-actions bot commented Mar 2, 2026

CI Commands

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants