-
Notifications
You must be signed in to change notification settings - Fork 3
Expand file tree
/
Copy pathknownissues
More file actions
40 lines (35 loc) · 2.08 KB
/
knownissues
File metadata and controls
40 lines (35 loc) · 2.08 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
===========================KNOWN ISSUES===========================
28/3/2018: Jobs can collide when trying to access the waitlist,
meaning that nobody realizes it's their turn, and jobs
end up idling until their walltime runs out, in which
case no cleanup is performed and the crawler eventually
dies.
Urgency: CRITICAL
Proposed Solutions: Using random wait times to stagger
the waitlist.crwl access attempts, creating a waitlist
folder and a second executable to handle batch cleanup,
or building cleanup into the triage module.
Status: MOSTLY FIXED
(Implemented waitlist folder with unique tokens,
and codes which submit a very large number of
jobs (SBDART) are encouraged to implement a
master-slave hierarchy where only some jobs are
expected to try to execute the main program.
However, sometimes this nonetheless seems to
occur when huge numbers of jobs are running. This
is not an issue with larger codes like PlaSim.)
28/3/2018: For whatever reason, sometimes running_<model>.crwl files
get wiped, resulting in collisions within job folders as
jobs are allocated the same workspace.
Urgency: IMPORTANT
Proposed Solutions: Find out what's causing this and fix
it, replace running_*.crwl files with routine that
actually checks running jobs and reads their job.crwl or
job.npy files to see which model allocation they belong to.
This would probably also be part of the triage module.
Status: FIXED
(running_<model>.crwl files removed; now program
directly checks which jobs are running/queued, and
determines which folders they're using from
system-provided information. There should now be
no possibility of collision).