Skip to content

rrdesi_mpi crashes if given too many ranks #340

@sbailey

Description

@sbailey

Example:

$> cd /global/cfs/cdirs/desi/spectro/redux/loa/tiles/cumulative/1000/20210517
$> srun -n 128 rrdesi_mpi -i coadd-0-1000-thru20210517.fits -o $SCRATCH/redrock-test.fits
Running with 128 MPI ranks
Loading targets...
...
Read and broadcast of 11 templates: 0.3 seconds
Creating GPU context: 0.0 seconds
--- Process 114 raised an exception ---
Proc 114: Traceback (most recent call last):
Proc 114:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/redrock/main/py/redrock/external/desi.py", line 1042, in rrdesi
    dtemplates = load_dist_templates(dwave, templates=args.templates,
Proc 114:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/redrock/main/py/redrock/templates.py", line 773, in load_dist_templates
    dtemplate = DistTemplate(t, dwave, mp_procs=mp_procs, comm=comm, use_gpu=use_gpu, gpu_mode=gpu_mode)
Proc 114:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/redrock/main/py/redrock/templates.py", line 584, in __init__
    data = rebin_template(self._template, myz, self._dwave, use_gpu=use_gpu)
Proc 114:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/redrock/main/py/redrock/rebin.py", line 491, in rebin_template
    xmin = template.minwave*(1+myz.max())
Proc 114:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/numpy/core/_methods.py", line 40, in _amax
    return umr_maximum(a, axis, None, out, keepdims, initial, where)
Proc 114: ValueError: zero-size array to reduction operation maximum which has no identity

MPICH Notice [Rank 114] [job id 41054598.2] [Fri Jul 25 16:47:04 2025] [nid004182] - Abort(0) (rank 114 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 0) - process 114

aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 114
srun: error: nid004182: task 114: Exited with exit code 255
srun: Terminating StepId=41054598.2
slurmstepd: error: *** STEP 41054598.2 ON nid004182 CANCELLED AT 2025-07-25T23:47:04 ***
srun: error: nid004182: tasks 0-113,115-127: Terminated
srun: Force Terminated StepId=41054598.2

running with srun -n 64 ... works fine though.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions