Skip to content

report local resources to identify reason for job failiures #15

@sillitoe

Description

@sillitoe

Jobs may fail for a bunch of reasons:

  • running out of disk space /dev/shm
  • running out of memory (eg t_mem=7G)
  • running out of time (eg h_rt=1:0:0)

It would be useful if we can identify when each of these cases have happened so the job can be rerun with different resources (eg request more memory, time, etc).

Ideally, SGE would record why and when it kills a job on hitting a hard limit. It looks like this is a slightly murky area (qacct seems to do this, but it's generally turned off; myriad has jobhist but that's a custom solution). You can see how much resources that a job is currently using with qstat -j <job_id>, however this info disappears once the job has finished.

One possibility is for a running script to regularly append job resource usage to a given file:

echo $(date) ': ' $(qstat -j $JOB_ID | grep 'resource_list')
while true
do
  echo $(date) ': ' $(qstat -j $JOB_ID | grep 'usage')
  sleep 10
done

This outputs something sensible on the CS cluster:

Fri Aug 3 16:31:15 BST 2018 :  hard resource_list: h_vmem=2G,tmem=2G,h_rt=600
Fri Aug 3 16:31:23 BST 2018 :  usage 1: cpu=00:00:00, mem=0.00633 GBs, io=0.00498, vmem=108.043M, maxvmem=108.043M
Fri Aug 3 16:32:40 BST 2018 :  usage 1: cpu=00:00:00, mem=0.00949 GBs, io=0.00770, vmem=108.043M, maxvmem=108.043M
Fri Aug 3 16:32:51 BST 2018 :  usage 1: cpu=00:00:00, mem=0.01054 GBs, io=0.00909, vmem=221.484M, maxvmem=221.484M

Less so on myriad:

Fri 3 Aug 16:02:24 BST 2018 :  hard resource_list: maxversion=2,interactive=true,penalty=0,jch=2,bonus=0,memory=7G,jcj=2,h_rt=7200,jci=1
Fri 3 Aug 16:02:24 BST 2018 :  usage 1: cpu=01:10:45, mem=0.00000 GB s, io=0.63630 GB, vmem=N/A, maxvmem=N/A
Fri 3 Aug 16:02:34 BST 2018 :  usage 1: cpu=01:10:45, mem=0.00000 GB s, io=0.63630 GB, vmem=N/A, maxvmem=N/A

Apparently myriad uses cgroup to limit memory usage. This seems to be available here

$ cat /sys/fs/cgroup/memory/UCL/$JOB_ID.undefined/memory.limit_in_bytes
7516192768
$ cat /sys/fs/cgroup/memory/UCL/$JOB_ID.undefined/memory.usage_in_bytes
103129088
$ cat /sys/fs/cgroup/memory/UCL/$JOB_ID.undefined/memory.usage_in_bytes
103161856

However, these files aren't available on the CS cluster(!). Urgh.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions