Replies: 3 comments 1 reply
-
|
yeah, configuring MPI might be a bit of a hassle. using MPI
MPI.Init()
# using CUDA # For CUDA-Aware MPI
rank = MPI.Comm_rank(MPI.COMM_WORLD)
size = MPI.Comm_size(MPI.COMM_WORLD)
a = zeros(Int, size)
a[rank] = rank + 1
# CUDA.device!(rank)
# a = CuArray(a) # for CUDA-Aware MPI
MPI.Allreduce!(a, +, MPI.COMM_WORLD)
@info rank, a # should show a = [1, 2, 3, 4] for all the ranksI would test first that the CPU works, then uncomment the GPU parts to make sure GPU-GPU communication is enabled. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks @simone-silvestri , CPU works fine. In your code I had to change [ljg48@a1122u02n01.bouchet numerical-earth]$ srun julia --project=${PROJECT} message-pass.jl
┌ Warning: CUDA runtime library `libcudart.so.12` was loaded from a system path, `/apps/software/system/software/CUDA/12.6.0/targets/x86_64-linux/lib/libcudart.so.12`.
│ This may cause errors.
│
│ If you're running under a profiler, this situation is expected. Otherwise,
│ ensure that your library path environment variable (e.g., `PATH` on Windows
│ or `LD_LIBRARY_PATH` on Linux) does not include CUDA library paths.
│
│ In any other case, please file an issue.
└ @ CUDA /nfs/roberts/scratch/pi_me586/ljg48/julia_depot/packages/CUDA/Il00B/src/initialization.jl:198
┌ Warning: CUDA runtime library `libcublasLt.so.12` was loaded from a system path, `/apps/software/system/software/CUDA/12.6.0/targets/x86_64-linux/lib/libcublasLt.so.12`.
│ This may cause errors.
│
│ If you're running under a profiler, this situation is expected. Otherwise,
│ ensure that your library path environment variable (e.g., `PATH` on Windows
│ or `LD_LIBRARY_PATH` on Linux) does not include CUDA library paths.
│
│ In any other case, please file an issue.
└ @ CUDA /nfs/roberts/scratch/pi_me586/ljg48/julia_depot/packages/CUDA/Il00B/src/initialization.jl:198
┌ Warning: CUDA runtime library `libnvJitLink.so.12` was loaded from a system path, `/apps/software/system/software/CUDA/12.6.0/targets/x86_64-linux/lib/libnvJitLink.so.12`.
│ This may cause errors.
│
│ If you're running under a profiler, this situation is expected. Otherwise,
│ ensure that your library path environment variable (e.g., `PATH` on Windows
│ or `LD_LIBRARY_PATH` on Linux) does not include CUDA library paths.
│
│ In any other case, please file an issue.
└ @ CUDA /nfs/roberts/scratch/pi_me586/ljg48/julia_depot/packages/CUDA/Il00B/src/initialization.jl:198
┌ Warning: CUDA runtime library `libcusparse.so.12` was loaded from a system path, `/apps/software/system/software/CUDA/12.6.0/targets/x86_64-linux/lib/libcusparse.so.12`.
│ This may cause errors.
│
│ If you're running under a profiler, this situation is expected. Otherwise,
│ ensure that your library path environment variable (e.g., `PATH` on Windows
│ or `LD_LIBRARY_PATH` on Linux) does not include CUDA library paths.
│
│ In any other case, please file an issue.
└ @ CUDA /nfs/roberts/scratch/pi_me586/ljg48/julia_depot/packages/CUDA/Il00B/src/initialization.jl:198
┌ Warning: CUDA runtime library `libcudart.so.12` was loaded from a system path, `/apps/software/system/software/CUDA/12.6.0/targets/x86_64-linux/lib/libcudart.so.12`.
│ This may cause errors.
│
│ If you're running under a profiler, this situation is expected. Otherwise,
│ ensure that your library path environment variable (e.g., `PATH` on Windows
│ or `LD_LIBRARY_PATH` on Linux) does not include CUDA library paths.
│
│ In any other case, please file an issue.
└ @ CUDA /nfs/roberts/scratch/pi_me586/ljg48/julia_depot/packages/CUDA/Il00B/src/initialization.jl:198
┌ Warning: CUDA runtime library `libcublasLt.so.12` was loaded from a system path, `/apps/software/system/software/CUDA/12.6.0/targets/x86_64-linux/lib/libcublasLt.so.12`.
│ This may cause errors.
│
│ If you're running under a profiler, this situation is expected. Otherwise,
│ ensure that your library path environment variable (e.g., `PATH` on Windows
│ or `LD_LIBRARY_PATH` on Linux) does not include CUDA library paths.
│
│ In any other case, please file an issue.
└ @ CUDA /nfs/roberts/scratch/pi_me586/ljg48/julia_depot/packages/CUDA/Il00B/src/initialization.jl:198
┌ Warning: CUDA runtime library `libnvJitLink.so.12` was loaded from a system path, `/apps/software/system/software/CUDA/12.6.0/targets/x86_64-linux/lib/libnvJitLink.so.12`.
│ This may cause errors.
│
│ If you're running under a profiler, this situation is expected. Otherwise,
│ ensure that your library path environment variable (e.g., `PATH` on Windows
│ or `LD_LIBRARY_PATH` on Linux) does not include CUDA library paths.
│
│ In any other case, please file an issue.
└ @ CUDA /nfs/roberts/scratch/pi_me586/ljg48/julia_depot/packages/CUDA/Il00B/src/initialization.jl:198
┌ Warning: CUDA runtime library `libcusparse.so.12` was loaded from a system path, `/apps/software/system/software/CUDA/12.6.0/targets/x86_64-linux/lib/libcusparse.so.12`.
│ This may cause errors.
│
│ If you're running under a profiler, this situation is expected. Otherwise,
│ ensure that your library path environment variable (e.g., `PATH` on Windows
│ or `LD_LIBRARY_PATH` on Linux) does not include CUDA library paths.
│
│ In any other case, please file an issue.
└ @ CUDA /nfs/roberts/scratch/pi_me586/ljg48/julia_depot/packages/CUDA/Il00B/src/initialization.jl:198
ERROR: LoadError: CUDA error: invalid device ordinal (code 101, ERROR_INVALID_DEVICE)
Stacktrace:
[1] throw_api_error(res::CUDA.cudaError_enum)
@ CUDA /nfs/roberts/scratch/pi_me586/ljg48/julia_depot/packages/CUDA/Il00B/lib/cudadrv/libcuda.jl:30
[2] check(f::CUDA.var"#cuDeviceGet##0#cuDeviceGet##1"{Base.RefValue{Int32}, Int64})
@ CUDA /nfs/roberts/scratch/pi_me586/ljg48/julia_depot/packages/CUDA/Il00B/lib/cudadrv/libcuda.jl:37
[3] cuDeviceGet(device::Base.RefValue{Int32}, ordinal::Int64)
@ CUDA /nfs/roberts/scratch/pi_me586/ljg48/julia_depot/packages/GPUToolbox/x8hVh/src/ccalls.jl:33
[4] CuDevice
@ /nfs/roberts/scratch/pi_me586/ljg48/julia_depot/packages/CUDA/Il00B/lib/cudadrv/devices.jl:17 [inlined]
[5] device!
@ /nfs/roberts/scratch/pi_me586/ljg48/julia_depot/packages/CUDA/Il00B/lib/cudadrv/state.jl:324 [inlined]
[6] device!(dev::Int64)
@ CUDA /nfs/roberts/scratch/pi_me586/ljg48/julia_depot/packages/CUDA/Il00B/lib/cudadrv/state.jl:324
[7] top-level scope
@ /nfs/roberts/project/pi_me586/ljg48/numerical-earth/message-pass.jl:11
[8] include(mod::Module, _path::String)
@ Base ./Base.jl:306
[9] exec_options(opts::Base.JLOptions)
@ Base ./client.jl:317
[10] _start()
@ Base ./client.jl:550
in expression starting at /nfs/roberts/project/pi_me586/ljg48/numerical-earth/message-pass.jl:11
srun: error: a1122u02n01: task 1: Exited with exit code 1
slurmstepd: error: mpi/pmix_v5: _errhandler: a1122u02n01 [0]: pmixp_client_v2.c:212: Error handler invoked: status = -61, source = [slurm.pmix.9327659.5:1]
slurmstepd: error: *** STEP 9327659.5 ON a1122u02n01 CANCELLED AT 2026-04-23T17:06:42 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: a1122u02n01: task 0: KilledIt didn't like the Looping in Tom Langford @tlangfor, who been helping me with this too. @simone-silvestri , any thoughts? |
Beta Was this translation helpful? Give feedback.
-
|
This PR in Oceananigans.jl solved this issue. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to start running simulations on multiple GPUs on Yale's Bouchet cluster. I was able to successfully run the one-degree global sea-ice simulation on a single Nvidia h200 GPU and augmented my code to run on two h200 GPUs. When I start the simulation it just hangs. Any thoughts on why this may happen?
Here are the modules and exports I use:
and here is how I setup the architecture:
One of the issues I was having is packages couldn't find MPI. This was solved from the help of this discussion. I had to explicitly set the paths in the
LocalPreferences.toml. This feels like a clunky solution ... but it works.Please let me know if you need more information, any help would be greatly appreciated. Thanks! @simone-silvestri
Beta Was this translation helpful? Give feedback.
All reactions