report local resources to identify reason for job failiures

Jobs may fail for a bunch of reasons:

* running out of disk space `/dev/shm` 
* running out of memory (eg `t_mem=7G`)
* running out of time (eg `h_rt=1:0:0`)

It would be useful if we can identify when each of these cases have happened so the job can be rerun with different resources (eg request more memory, time, etc).

Ideally, SGE would record why and when it kills a job on hitting a hard limit. It looks like this is a slightly murky area (`qacct` seems to do this, but it's generally turned off; myriad has `jobhist` but that's a custom solution). You can see how much resources that a job is currently using with `qstat -j <job_id>`, however this info disappears once the job has finished.

One possibility is for a running script to regularly append job resource usage to a given file:

```
echo $(date) ': ' $(qstat -j $JOB_ID | grep 'resource_list')
while true
do
  echo $(date) ': ' $(qstat -j $JOB_ID | grep 'usage')
  sleep 10
done
```

This outputs something sensible on the CS cluster:

```
Fri Aug 3 16:31:15 BST 2018 :  hard resource_list: h_vmem=2G,tmem=2G,h_rt=600
Fri Aug 3 16:31:23 BST 2018 :  usage 1: cpu=00:00:00, mem=0.00633 GBs, io=0.00498, vmem=108.043M, maxvmem=108.043M
Fri Aug 3 16:32:40 BST 2018 :  usage 1: cpu=00:00:00, mem=0.00949 GBs, io=0.00770, vmem=108.043M, maxvmem=108.043M
Fri Aug 3 16:32:51 BST 2018 :  usage 1: cpu=00:00:00, mem=0.01054 GBs, io=0.00909, vmem=221.484M, maxvmem=221.484M
```

Less so on myriad: 

```
Fri 3 Aug 16:02:24 BST 2018 :  hard resource_list: maxversion=2,interactive=true,penalty=0,jch=2,bonus=0,memory=7G,jcj=2,h_rt=7200,jci=1
Fri 3 Aug 16:02:24 BST 2018 :  usage 1: cpu=01:10:45, mem=0.00000 GB s, io=0.63630 GB, vmem=N/A, maxvmem=N/A
Fri 3 Aug 16:02:34 BST 2018 :  usage 1: cpu=01:10:45, mem=0.00000 GB s, io=0.63630 GB, vmem=N/A, maxvmem=N/A
```

Apparently myriad uses `cgroup` to limit memory usage. This seems to be available here
 
```
$ cat /sys/fs/cgroup/memory/UCL/$JOB_ID.undefined/memory.limit_in_bytes
7516192768
$ cat /sys/fs/cgroup/memory/UCL/$JOB_ID.undefined/memory.usage_in_bytes
103129088
$ cat /sys/fs/cgroup/memory/UCL/$JOB_ID.undefined/memory.usage_in_bytes
103161856
```

However, these files aren't available on the CS cluster(!). Urgh.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

report local resources to identify reason for job failiures #15

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

report local resources to identify reason for job failiures #15

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions