crawler2/knownissues at master · alphaparrot/crawler2 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

===========================KNOWN ISSUES===========================

28/3/2018: Jobs can collide when trying to access the waitlist,
           meaning that nobody realizes it's their turn, and jobs
           end up idling until their walltime runs out, in which
           case no cleanup is performed and the crawler eventually
           dies.
           Urgency: CRITICAL
           Proposed Solutions: Using random wait times to stagger
           the waitlist.crwl access attempts, creating a waitlist
           folder and a second executable to handle batch cleanup,
           or building cleanup into the triage module.
           Status: MOSTLY FIXED
                   (Implemented waitlist folder with unique tokens,
                    and codes which submit a very large number of
                    jobs (SBDART) are encouraged to implement a
                    master-slave hierarchy where only some jobs are
                    expected to try to execute the main program.
                    However, sometimes this nonetheless seems to
                    occur when huge numbers of jobs are running. This
                    is not an issue with larger codes like PlaSim.)

28/3/2018: For whatever reason, sometimes running_<model>.crwl files
           get wiped, resulting in collisions within job folders as
           jobs are allocated the same workspace.
           Urgency: IMPORTANT
           Proposed Solutions: Find out what's causing this and fix
           it, replace running_*.crwl files with routine that
           actually checks running jobs and reads their job.crwl or
           job.npy files to see which model allocation they belong to.
           This would probably also be part of the triage module.
           Status: FIXED
                   (running_<model>.crwl files removed; now program
                    directly checks which jobs are running/queued, and
                    determines which folders they're using from
                    system-provided information. There should now be
                    no possibility of collision).