Conversation
skalyan
left a comment
There was a problem hiding this comment.
What is the motivation for this change? Did we have any user requests?
Is the idea to simply hide Slurm env variables and add a wrapper?
|
@skalyan yeah, this is to enable: facebookresearch/matrix#105 (comment)
it gives the info for slurm or local nodes |
| return int(os.environ.get("SLURM_CPUS_ON_NODE", 1)) | ||
| return int(max(os.cpu_count() or 0, 1)) | ||
|
|
||
| @lru_cache(maxsize=1) | ||
| def get_gpus(self) -> int: | ||
| if self.is_slurm_job(): | ||
| return int(os.environ.get("SLURM_GPUS_ON_NODE", 1)) | ||
| return sum( |
There was a problem hiding this comment.
What is the intent of the ask?
Is it to know how many GPUs/CPUs are "present" on the node or "allocated" to this job?
There was a problem hiding this comment.
allocated to the job if slurm job, otherwise what's present in the node
There was a problem hiding this comment.
If I may, I think we should stick to a clear and limited contract for an API - If we want to return GPUs allocated to a job for a given API let's stick to that. I don't see benefits in either allocated or provisioned GPU count coming via the same API.
There was a problem hiding this comment.
@skalyan I think this is fine for now, it matches how other methods from this class behaves. I'm up to change position here as we see how it gets used
Summary
exposing a way to get number of cpus and gpus from job information. defaults to local host if not on a slurm cluster
Test Plan
works on slurm and locally:
local node with gpus: