Convert memory values from MB to bytes before writing to Scuba#71
Convert memory values from MB to bytes before writing to Scuba#71Yash0270 wants to merge 1 commit intofacebookresearch:mainfrom
Conversation
CI CommandsThe following CI workflows run automatically on every push and pull request:
The following commands can be used by maintainers to trigger additional tests that require access to secrets:
|
…ookresearch#71) Summary: Pull Request resolved: facebookresearch#71 A user reported jobs not starting despite nodes appearing idle (Jobs are not starting for more than 2 hours, while nodes are idle in FAIR AWS fair-sc (aka AWS H100/H200 DSS-3/4) Cluster by mahnerak). During debugging, we discovered that Scuba displays memory values incorrectly for FAIR HPC cluster data. Slurm reports memory in MB, and GCM writes these raw MB integers to Scuba without conversion. Scuba's SI-prefix auto-formatter then displays 1,524,000 (MB) as "1.524 M", which reads as 1.524 megabytes when it's actually 1.524 TB (1,524,000 MB). This makes investigating node memory availability confusing and error-prone. ## What's New | Component | Description | |-----------|-------------| | `mb_to_bytes` | New helper in parsing.py to convert MB integers to bytes | | `convert_memory_to_bytes` | New helper that converts suffixed strings (e.g., "60G") directly to bytes | | `maybe_int_mb_to_bytes` | Parser for sinfo MEMORY/FREE_MEM fields (int MB -> bytes) | | `sinfo_node.py` | MEMORY and FREE_MEM now stored in bytes | | `squeue.py` | MIN_MEMORY and TRES_MEM_ALLOCATED now stored in bytes | | `scontrol.py` | TresMEM now stored in bytes (billing weights unchanged) | ## Key Design Decisions - **Field-level conversion over parse_value_from_tres change**: parse_value_from_tres is shared between actual memory amounts (TRES totals, allocated) and billing weight coefficients (TresBillingWeightMEM). Converting at the field level avoids corrupting billing weights, which are dimensionless coefficients, not memory amounts. - **MB * 1,000,000 (decimal)**: Matches Scuba's SI-prefix formatter which uses powers of 10 (K=10^3, M=10^6, G=10^9, T=10^12). ## After Fix | Value (bytes) | Scuba displays | Meaning | |---|---|---| | 500,000,000,000 | "500 G" | 500 GB | | 1,524,000,000,000 | "1.524 T" | 1.524 TB | Reviewed By: luccabb, yonglimeta Differential Revision: D94605913
68a2df3 to
4598b72
Compare
| data[name] = int(match.group(1)) | ||
| else: | ||
| data[name] = None | ||
|
|
There was a problem hiding this comment.
why dont we add "slurmctld_thread_cpu" to sdiag here and keep the collect_slurm concise.
| return int(float(value[:-1]) * multipliers[suffix]) | ||
|
|
||
|
|
||
| def maybe_parse_memory_to_bytes(v: str) -> int | None: |
There was a problem hiding this comment.
can use this decorator from 74 @log_error(name, return_on_error=0), so no need to import logger
gcm/monitoring/slurm/client.py
Outdated
| output = subprocess.check_output( | ||
| ["scontrol", "show", "config"], text=True | ||
| ) |
There was a problem hiding this comment.
Don't feel it's good pattern to subprocess call from inside a call, then also do a regex search. It adds a lot of complexity to sdiag_structured.
|
Thanks for the comments @giongto35 , I pushed an additional incomplete PR in this change. Correcting it now. |
4598b72 to
02693b9
Compare
…ookresearch#71) Summary: Pull Request resolved: facebookresearch#71 A user reported jobs not starting despite nodes appearing idle (Jobs are not starting for more than 2 hours, while nodes are idle in FAIR AWS fair-sc (aka AWS H100/H200 DSS-3/4) Cluster by mahnerak). During debugging, we discovered that Scuba displays memory values incorrectly for FAIR HPC cluster data. Slurm reports memory in MB, and GCM writes these raw MB integers to Scuba without conversion. Scuba's SI-prefix auto-formatter then displays 1,524,000 (MB) as "1.524 M", which reads as 1.524 megabytes when it's actually 1.524 TB (1,524,000 MB). This makes investigating node memory availability confusing and error-prone. ## What's New | Component | Description | |-----------|-------------| | `mb_to_bytes` | New helper in parsing.py to convert MB integers to bytes | | `convert_memory_to_bytes` | New helper that converts suffixed strings (e.g., "60G") directly to bytes | | `maybe_int_mb_to_bytes` | Parser for sinfo MEMORY/FREE_MEM fields (int MB -> bytes) | | `sinfo_node.py` | MEMORY and FREE_MEM now stored in bytes | | `squeue.py` | MIN_MEMORY and TRES_MEM_ALLOCATED now stored in bytes | | `scontrol.py` | TresMEM now stored in bytes (billing weights unchanged) | ## Key Design Decisions - **Field-level conversion over parse_value_from_tres change**: parse_value_from_tres is shared between actual memory amounts (TRES totals, allocated) and billing weight coefficients (TresBillingWeightMEM). Converting at the field level avoids corrupting billing weights, which are dimensionless coefficients, not memory amounts. - **MB * 1,000,000 (decimal)**: Matches Scuba's SI-prefix formatter which uses powers of 10 (K=10^3, M=10^6, G=10^9, T=10^12). ## After Fix | Value (bytes) | Scuba displays | Meaning | |---|---|---| | 500,000,000,000 | "500 G" | 500 GB | | 1,524,000,000,000 | "1.524 T" | 1.524 TB | Reviewed By: luccabb, yonglimeta Differential Revision: D94605913
Summary: Adds `parse_memory_to_bytes` as a single robust entry point for all Slurm memory parsing, handling plain integers (assumed MB per Slurm default), suffixed strings (M/G/T/P), and N/A values. This addresses reviewer feedback that if Slurm switches away from MB we break. ## What's New | Component | Description | |-----------|-------------| | `parse_memory_to_bytes` | New robust parser: plain ints (MB), suffixed (M/G/T/P), zero | | `maybe_parse_memory_to_bytes` | Nullable wrapper returning None for N/A/empty/invalid | | `sinfo_node.py` | FREE_MEM/MEMORY use `maybe_parse_memory_to_bytes` instead of `maybe_int_mb_to_bytes` | | `squeue.py` | MIN_MEMORY uses `maybe_parse_memory_to_bytes` instead of `convert_memory_to_mb` | | `convert_memory_to_mb` | Fixed plain-digit branch: treats unsuffixed values as MB (was incorrectly treating as bytes) | ## Key Design Decisions - **Single entry point**: All memory parsing goes through `parse_memory_to_bytes` so format changes only need fixing in one place - **Plain int = MB**: Slurm's default unit is MB; unsuffixed integers are treated as MB, not bytes - **Kept `convert_memory_to_mb`**: Still used by TRES path (`parse_value_from_tres`); fixed its plain-digit branch for consistency ## Related - Reviewer comments: D94605913 (luccab) - Parent diff: D94607438 Reviewed By: luccabb, yonglimeta Differential Revision: D94605913
02693b9 to
a8653ce
Compare
luccabb
left a comment
There was a problem hiding this comment.
Review automatically exported from Phabricator review in Meta.
Summary:
A user reported jobs not starting despite nodes appearing idle
(Jobs are not starting for more than 2 hours, while nodes are idle in FAIR AWS fair-sc (aka AWS H100/H200 DSS-3/4) Cluster by mahnerak).
During debugging, we discovered that Scuba displays memory values incorrectly
for FAIR HPC cluster data. Slurm reports memory in MB, and GCM writes these
raw MB integers to Scuba without conversion. Scuba's SI-prefix auto-formatter
then displays 1,524,000 (MB) as "1.524 M", which reads as 1.524 megabytes
when it's actually 1.524 TB (1,524,000 MB). This makes investigating node
memory availability confusing and error-prone.
What's New
mb_to_bytesconvert_memory_to_bytesmaybe_int_mb_to_bytessinfo_node.pysqueue.pyscontrol.pyKey Design Decisions
is shared between actual memory amounts (TRES totals, allocated) and billing weight
coefficients (TresBillingWeightMEM). Converting at the field level avoids corrupting
billing weights, which are dimensionless coefficients, not memory amounts.
powers of 10 (K=10^3, M=10^6, G=10^9, T=10^12).
After Fix
Reviewed By: luccabb, yonglimeta
Differential Revision: D94605913