$> cd /global/cfs/cdirs/desi/spectro/redux/loa/tiles/cumulative/1000/20210517
$> srun -n 128 rrdesi_mpi -i coadd-0-1000-thru20210517.fits -o $SCRATCH/redrock-test.fits
Running with 128 MPI ranks
Loading targets...
...
Read and broadcast of 11 templates: 0.3 seconds
Creating GPU context: 0.0 seconds
--- Process 114 raised an exception ---
Proc 114: Traceback (most recent call last):
Proc 114: File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/redrock/main/py/redrock/external/desi.py", line 1042, in rrdesi
dtemplates = load_dist_templates(dwave, templates=args.templates,
Proc 114: File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/redrock/main/py/redrock/templates.py", line 773, in load_dist_templates
dtemplate = DistTemplate(t, dwave, mp_procs=mp_procs, comm=comm, use_gpu=use_gpu, gpu_mode=gpu_mode)
Proc 114: File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/redrock/main/py/redrock/templates.py", line 584, in __init__
data = rebin_template(self._template, myz, self._dwave, use_gpu=use_gpu)
Proc 114: File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/redrock/main/py/redrock/rebin.py", line 491, in rebin_template
xmin = template.minwave*(1+myz.max())
Proc 114: File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/numpy/core/_methods.py", line 40, in _amax
return umr_maximum(a, axis, None, out, keepdims, initial, where)
Proc 114: ValueError: zero-size array to reduction operation maximum which has no identity
MPICH Notice [Rank 114] [job id 41054598.2] [Fri Jul 25 16:47:04 2025] [nid004182] - Abort(0) (rank 114 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 0) - process 114
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 114
srun: error: nid004182: task 114: Exited with exit code 255
srun: Terminating StepId=41054598.2
slurmstepd: error: *** STEP 41054598.2 ON nid004182 CANCELLED AT 2025-07-25T23:47:04 ***
srun: error: nid004182: tasks 0-113,115-127: Terminated
srun: Force Terminated StepId=41054598.2
Example:
running with
srun -n 64 ...works fine though.