Conversation
|
Has this been shown to work? |
|
Haven't had any issues so far since it was applied |
| with_items: | ||
| - { name: kernel.panic, value: 5 } | ||
| - name: kernel.hung_task_timeout_secs | ||
| value: 0 |
There was a problem hiding this comment.
This is definitely going to make the message go away, since it stops the kernel checking for hung tasks. Without the message, we're not going to know if we fixed the underlying issue.
There was a problem hiding this comment.
Hmm well we can always put it back if we want to debug.... its just quite annoying as this is part of the reason why nodes won't reboot automatically. By default, we dont want things to hang... this should be an opt in.
There was a problem hiding this comment.
But this option just controls whether the kernel reports that tasks have hung, it doesn't affect whether tasks will hang.
There was a problem hiding this comment.
I think this is what is preventing automatic reboot though?
There was a problem hiding this comment.
It's just for reporting, see https://www.kernel.org/doc/Documentation/sysctl/kernel.txt.
This article suggests you can panic on hung tasks by setting hung kernel.hung_task_panic=1, from kernel 2.6.35. Then to cause the kernel to reboot on panic, you set kernel.panic= (as you have done).
5f7faa8 to
b1b77f3
Compare
We are repeatedly seeing this kernel panic and this is probably a solution... it is difficult to replicate as it happens randomly on the different nodes when running slurm jobs... lets see if this fixes the problem (this has a good explanation for why this could work: https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/)
Log snippet: