Convert memory values from MB to bytes before writing to Scuba by Yash0270 · Pull Request #71 · facebookresearch/gcm

Yash0270 · 2026-02-28T00:37:59Z

Summary:
A user reported jobs not starting despite nodes appearing idle
(Jobs are not starting for more than 2 hours, while nodes are idle in FAIR AWS fair-sc (aka AWS H100/H200 DSS-3/4) Cluster by mahnerak).
During debugging, we discovered that Scuba displays memory values incorrectly
for FAIR HPC cluster data. Slurm reports memory in MB, and GCM writes these
raw MB integers to Scuba without conversion. Scuba's SI-prefix auto-formatter
then displays 1,524,000 (MB) as "1.524 M", which reads as 1.524 megabytes
when it's actually 1.524 TB (1,524,000 MB). This makes investigating node
memory availability confusing and error-prone.

What's New

Component	Description
`mb_to_bytes`	New helper in parsing.py to convert MB integers to bytes
`convert_memory_to_bytes`	New helper that converts suffixed strings (e.g., "60G") directly to bytes
`maybe_int_mb_to_bytes`	Parser for sinfo MEMORY/FREE_MEM fields (int MB -> bytes)
`sinfo_node.py`	MEMORY and FREE_MEM now stored in bytes
`squeue.py`	MIN_MEMORY and TRES_MEM_ALLOCATED now stored in bytes
`scontrol.py`	TresMEM now stored in bytes (billing weights unchanged)

Key Design Decisions

Field-level conversion over parse_value_from_tres change: parse_value_from_tres
is shared between actual memory amounts (TRES totals, allocated) and billing weight
coefficients (TresBillingWeightMEM). Converting at the field level avoids corrupting
billing weights, which are dimensionless coefficients, not memory amounts.
MB * 1,000,000 (decimal): Matches Scuba's SI-prefix formatter which uses
powers of 10 (K=10^3, M=10^6, G=10^9, T=10^12).

After Fix

Value (bytes)	Scuba displays	Meaning
500,000,000,000	"500 G"	500 GB
1,524,000,000,000	"1.524 T"	1.524 TB

Reviewed By: luccabb, yonglimeta

Differential Revision: D94605913

github-actions · 2026-02-28T00:38:10Z

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow	What it runs
GPU Cluster Monitoring Python CI	lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI	shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command	Description	Requires approval?
`/metaci tests`	Runs Meta internal integration tests (pytest)	Yes — a maintainer must trigger the command and approve the deployment request
`/metaci integration tests`	Same as above (alias)	Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

meta-codesync · 2026-02-28T00:38:25Z

@Yash0270 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D94605913.

…ookresearch#71) Summary: Pull Request resolved: facebookresearch#71 A user reported jobs not starting despite nodes appearing idle (Jobs are not starting for more than 2 hours, while nodes are idle in FAIR AWS fair-sc (aka AWS H100/H200 DSS-3/4) Cluster by mahnerak). During debugging, we discovered that Scuba displays memory values incorrectly for FAIR HPC cluster data. Slurm reports memory in MB, and GCM writes these raw MB integers to Scuba without conversion. Scuba's SI-prefix auto-formatter then displays 1,524,000 (MB) as "1.524 M", which reads as 1.524 megabytes when it's actually 1.524 TB (1,524,000 MB). This makes investigating node memory availability confusing and error-prone. ## What's New | Component | Description | |-----------|-------------| | `mb_to_bytes` | New helper in parsing.py to convert MB integers to bytes | | `convert_memory_to_bytes` | New helper that converts suffixed strings (e.g., "60G") directly to bytes | | `maybe_int_mb_to_bytes` | Parser for sinfo MEMORY/FREE_MEM fields (int MB -> bytes) | | `sinfo_node.py` | MEMORY and FREE_MEM now stored in bytes | | `squeue.py` | MIN_MEMORY and TRES_MEM_ALLOCATED now stored in bytes | | `scontrol.py` | TresMEM now stored in bytes (billing weights unchanged) | ## Key Design Decisions - **Field-level conversion over parse_value_from_tres change**: parse_value_from_tres is shared between actual memory amounts (TRES totals, allocated) and billing weight coefficients (TresBillingWeightMEM). Converting at the field level avoids corrupting billing weights, which are dimensionless coefficients, not memory amounts. - **MB * 1,000,000 (decimal)**: Matches Scuba's SI-prefix formatter which uses powers of 10 (K=10^3, M=10^6, G=10^9, T=10^12). ## After Fix | Value (bytes) | Scuba displays | Meaning | |---|---|---| | 500,000,000,000 | "500 G" | 500 GB | | 1,524,000,000,000 | "1.524 T" | 1.524 TB | Reviewed By: luccabb, yonglimeta Differential Revision: D94605913

giongto35 · 2026-03-02T03:02:56Z

gcm/monitoring/slurm/client.py

                data[name] = int(match.group(1))
            else:
                data[name] = None



why dont we add "slurmctld_thread_cpu" to sdiag here and keep the collect_slurm concise.

giongto35 · 2026-03-02T03:10:20Z

gcm/monitoring/slurm/parsing.py

+    return int(float(value[:-1]) * multipliers[suffix])
+
+
+def maybe_parse_memory_to_bytes(v: str) -> int | None:


can use this decorator from 74 @log_error(name, return_on_error=0), so no need to import logger

giongto35 · 2026-03-02T03:16:55Z

gcm/monitoring/slurm/client.py

+            output = subprocess.check_output(
+                ["scontrol", "show", "config"], text=True
+            )


Don't feel it's good pattern to subprocess call from inside a call, then also do a regex search. It adds a lot of complexity to sdiag_structured.

Yash0270 · 2026-03-02T18:31:26Z

Thanks for the comments @giongto35 , I pushed an additional incomplete PR in this change. Correcting it now.

…ookresearch#71) Summary: Pull Request resolved: facebookresearch#71 A user reported jobs not starting despite nodes appearing idle (Jobs are not starting for more than 2 hours, while nodes are idle in FAIR AWS fair-sc (aka AWS H100/H200 DSS-3/4) Cluster by mahnerak). During debugging, we discovered that Scuba displays memory values incorrectly for FAIR HPC cluster data. Slurm reports memory in MB, and GCM writes these raw MB integers to Scuba without conversion. Scuba's SI-prefix auto-formatter then displays 1,524,000 (MB) as "1.524 M", which reads as 1.524 megabytes when it's actually 1.524 TB (1,524,000 MB). This makes investigating node memory availability confusing and error-prone. ## What's New | Component | Description | |-----------|-------------| | `mb_to_bytes` | New helper in parsing.py to convert MB integers to bytes | | `convert_memory_to_bytes` | New helper that converts suffixed strings (e.g., "60G") directly to bytes | | `maybe_int_mb_to_bytes` | Parser for sinfo MEMORY/FREE_MEM fields (int MB -> bytes) | | `sinfo_node.py` | MEMORY and FREE_MEM now stored in bytes | | `squeue.py` | MIN_MEMORY and TRES_MEM_ALLOCATED now stored in bytes | | `scontrol.py` | TresMEM now stored in bytes (billing weights unchanged) | ## Key Design Decisions - **Field-level conversion over parse_value_from_tres change**: parse_value_from_tres is shared between actual memory amounts (TRES totals, allocated) and billing weight coefficients (TresBillingWeightMEM). Converting at the field level avoids corrupting billing weights, which are dimensionless coefficients, not memory amounts. - **MB * 1,000,000 (decimal)**: Matches Scuba's SI-prefix formatter which uses powers of 10 (K=10^3, M=10^6, G=10^9, T=10^12). ## After Fix | Value (bytes) | Scuba displays | Meaning | |---|---|---| | 500,000,000,000 | "500 G" | 500 GB | | 1,524,000,000,000 | "1.524 T" | 1.524 TB | Reviewed By: luccabb, yonglimeta Differential Revision: D94605913

Summary: Adds `parse_memory_to_bytes` as a single robust entry point for all Slurm memory parsing, handling plain integers (assumed MB per Slurm default), suffixed strings (M/G/T/P), and N/A values. This addresses reviewer feedback that if Slurm switches away from MB we break. ## What's New | Component | Description | |-----------|-------------| | `parse_memory_to_bytes` | New robust parser: plain ints (MB), suffixed (M/G/T/P), zero | | `maybe_parse_memory_to_bytes` | Nullable wrapper returning None for N/A/empty/invalid | | `sinfo_node.py` | FREE_MEM/MEMORY use `maybe_parse_memory_to_bytes` instead of `maybe_int_mb_to_bytes` | | `squeue.py` | MIN_MEMORY uses `maybe_parse_memory_to_bytes` instead of `convert_memory_to_mb` | | `convert_memory_to_mb` | Fixed plain-digit branch: treats unsuffixed values as MB (was incorrectly treating as bytes) | ## Key Design Decisions - **Single entry point**: All memory parsing goes through `parse_memory_to_bytes` so format changes only need fixing in one place - **Plain int = MB**: Slurm's default unit is MB; unsuffixed integers are treated as MB, not bytes - **Kept `convert_memory_to_mb`**: Still used by TRES path (`parse_value_from_tres`); fixed its plain-digit branch for consistency ## Related - Reviewer comments: D94605913 (luccab) - Parent diff: D94607438 Reviewed By: luccabb, yonglimeta Differential Revision: D94605913

luccabb

Review automatically exported from Phabricator review in Meta.

Yash0270 requested review from calebho, giongto35, jj10306 and luccabb as code owners February 28, 2026 00:38

meta-cla bot added the cla signed label Feb 28, 2026

meta-codesync bot added fb-exported meta-exported labels Feb 28, 2026

Yash0270 force-pushed the export-D94605913 branch from 68a2df3 to 4598b72 Compare February 28, 2026 00:41

giongto35 reviewed Mar 2, 2026

View reviewed changes

Yash0270 marked this pull request as draft March 2, 2026 18:30

Yash0270 force-pushed the export-D94605913 branch from 4598b72 to 02693b9 Compare March 2, 2026 18:48

Yash0270 force-pushed the export-D94605913 branch from 02693b9 to a8653ce Compare March 4, 2026 04:56

luccabb approved these changes Mar 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert memory values from MB to bytes before writing to Scuba#71

Convert memory values from MB to bytes before writing to Scuba#71
Yash0270 wants to merge 1 commit intofacebookresearch:mainfrom
Yash0270:export-D94605913

Yash0270 commented Feb 28, 2026

Uh oh!

github-actions bot commented Feb 28, 2026

Uh oh!

meta-codesync bot commented Feb 28, 2026

Uh oh!

giongto35 Mar 2, 2026

Uh oh!

giongto35 Mar 2, 2026

Uh oh!

giongto35 Mar 2, 2026

Uh oh!

Yash0270 commented Mar 2, 2026

Uh oh!

luccabb left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return int(float(value[:-1]) * multipliers[suffix])


		def maybe_parse_memory_to_bytes(v: str) -> int \| None:

Conversation

Yash0270 commented Feb 28, 2026

What's New

Key Design Decisions

After Fix

Uh oh!

github-actions bot commented Feb 28, 2026

CI Commands

Uh oh!

meta-codesync bot commented Feb 28, 2026

Uh oh!

giongto35 Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

giongto35 Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

giongto35 Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Yash0270 commented Mar 2, 2026

Uh oh!

luccabb left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants