Lock conflicts in synchronization cause massive slowdown with SLURM or thread pinning

### H100 node on a SLURM cluster

```julia
julia> CUDA.versioninfo()
CUDA toolchain: 
- runtime 13.0, artifact installation
- driver 590.48.1 for 13.1
- compiler 13.1

CUDA libraries: 
- CUBLAS: 13.1.0
- CURAND: 10.4.0
- CUFFT: 12.0.0
- CUSOLVER: 12.0.4
- CUSPARSE: 12.6.3
- CUPTI: 2025.3.1 (API 13.0.1)
- NVML: 13.0.0+590.48.1

Julia packages: 
- CUDA: 5.9.6
- GPUArrays: 11.4.1
- GPUCompiler: 1.8.2
- KernelAbstractions: 0.9.40
- CUDA_Driver_jll: 13.1.0+2
- CUDA_Compiler_jll: 0.4.1+1
- CUDA_Runtime_jll: 0.19.2+0

Toolchain:
- Julia: 1.12.5
- LLVM: 18.1.7

Preferences:
- nonblocking_synchronization: true

1 device:
  0: NVIDIA H100 (sm_90, 93.083 GiB / 93.584 GiB available)
```

Here is a simple MWE. It is fast in the minimum, but non-deterministically slow sometimes, significantly increading the mean.
This is consistent with our observations from real kernels. Especially small kernels are very slow in the mean, so they took a similar time in our code as the large ones, even though they should be multitudes faster.
```julia
julia> x = CUDA.ones(1_000_000_000);

julia> @benchmark sum($x)
BenchmarkTools.Trial: 419 samples with 1 evaluation per sample.
 Range (min … max):   2.493 ms … 37.205 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.993 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.910 ms ±  9.594 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                 ▃█                      ▁               ▃  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇██▆▁█▇▄▁▁▁▅▅▄▁▁▁▁▁▁▁▁▁▄▁█▁▄▁▁▁▁▇▁▆▁▄▁▄▁▆█ ▆
  2.49 ms      Histogram: log(frequency) by time        35 ms <

 Memory estimate: 5.89 KiB, allocs estimate: 325.
```
It gets even slower when I request 4 cores in SLURM (the above is requesting only one core):
```julia
julia> @benchmark sum($x)
BenchmarkTools.Trial: 205 samples with 1 evaluation per sample.
 Range (min … max):   8.331 ms … 28.104 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     25.843 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   24.491 ms ±  4.620 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                          █    
  ▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇█▃▄ ▂
  8.33 ms         Histogram: frequency by time        26.7 ms <

 Memory estimate: 430.80 KiB, allocs estimate: 8504.
```
I noticed that `@time` reports lock conflicts here on the slow runs.

```julia
julia> function test(x)
           for _ in 1:20
               @time sum(x)
           end
       end;

julia> test(x)
  0.011377 seconds (325 allocations: 5.891 KiB, 1 lock conflict)
  0.002555 seconds (325 allocations: 5.891 KiB)
  0.002528 seconds (325 allocations: 5.891 KiB)
  0.002524 seconds (325 allocations: 5.891 KiB)
  0.002526 seconds (326 allocations: 6.172 KiB)
  0.002523 seconds (325 allocations: 5.891 KiB)
  0.002536 seconds (325 allocations: 5.891 KiB)
  0.002524 seconds (325 allocations: 5.891 KiB)
  0.012198 seconds (325 allocations: 5.891 KiB, 1 lock conflict)
  0.026833 seconds (325 allocations: 5.891 KiB, 2 lock conflicts)
  0.034979 seconds (325 allocations: 5.891 KiB, 2 lock conflicts)
  0.012982 seconds (325 allocations: 5.891 KiB, 1 lock conflict)
  0.012985 seconds (325 allocations: 5.891 KiB, 1 lock conflict)
  0.012260 seconds (325 allocations: 5.891 KiB, 1 lock conflict)
  0.025695 seconds (325 allocations: 5.891 KiB, 2 lock conflicts)
  0.035985 seconds (325 allocations: 5.891 KiB, 2 lock conflicts)
  0.012985 seconds (325 allocations: 5.891 KiB, 1 lock conflict)
  0.012983 seconds (325 allocations: 5.891 KiB, 1 lock conflict)
  0.012381 seconds (325 allocations: 5.891 KiB, 1 lock conflict)
  0.002549 seconds (325 allocations: 5.891 KiB)
```
I could trace back to CUDA.synchronize:
```
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] (::Base.var"#slowlock#1199")(rl::ReentrantLock)
    @ Base ./REPL[10]:12
  [3] lock
    @ ./REPL[10]:2 [inlined]
  [4] relockall
    @ ./lock.jl:208 [inlined]
  [5] wait(c::Base.GenericCondition{ReentrantLock}; first::Bool)
    @ Base ./condition.jl:135
  [6] wait
    @ ./condition.jl:125 [inlined]
  [7] put!(c::CUDA.BidirectionalChannel{Union{CuContext, CuEvent, CuStream}, CUDA.cudaError_enum}, v::CuStream)
    @ CUDA ~/.julia/packages/CUDA/FJf6p/lib/cudadrv/synchronization.jl:40
  [8] nonblocking_synchronize(val::CuStream)
    @ CUDA ~/.julia/packages/CUDA/FJf6p/lib/cudadrv/synchronization.jl:168
  [9] synchronize(stream::CuStream; blocking::Bool, spin::Bool)
    @ CUDA ~/.julia/packages/CUDA/FJf6p/lib/cudadrv/synchronization.jl:200
 [10] synchronize (repeats 2 times)
    @ ~/.julia/packages/CUDA/FJf6p/lib/cudadrv/synchronization.jl:194 [inlined]
 [11] synchronize
    @ ~/.julia/packages/CUDA/FJf6p/src/CUDAKernels.jl:32 [inlined]
```
Now with blocking synchronization,
```toml
[CUDA]
nonblocking_synchronization = false
```
this doesn't happen, and I get
```julia
julia> @benchmark sum($x)
BenchmarkTools.Trial: 1131 samples with 1 evaluation per sample.
 Range (min … max):  2.379 ms … 26.466 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.390 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.388 ms ±  5.062 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                            ▃                              
  █▃▁▁▁▁▁▁▁▁▃▁▁▁▃▁▁▁▁▃▃▁▁▁▅▅▃▇▆█▆▅▅▆▁▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▃▁▅▁▁▇▁▆ ▇
  2.38 ms      Histogram: log(frequency) by time       25 ms <

 Memory estimate: 1.84 KiB, allocs estimate: 66.
```
and no lock conflicts reported by `@time`, but still compilation time for some reason:
```julia
julia> test(x)
  0.004839 seconds (66 allocations: 1.844 KiB)
  0.004786 seconds (66 allocations: 1.844 KiB)
  0.004770 seconds (66 allocations: 1.844 KiB)
  0.015916 seconds (21.34 k allocations: 1.015 MiB)
  0.004844 seconds (66 allocations: 1.844 KiB)
  0.017654 seconds (26.20 k allocations: 1.252 MiB, 334.92% compilation time)
  0.004857 seconds (66 allocations: 1.844 KiB)
  0.004764 seconds (66 allocations: 1.844 KiB)
  0.017268 seconds (13.66 k allocations: 667.633 KiB)
  0.004870 seconds (66 allocations: 1.844 KiB)
  0.017050 seconds (22.13 k allocations: 1.053 MiB)
  0.004870 seconds (66 allocations: 1.844 KiB)
  0.004781 seconds (67 allocations: 2.125 KiB)
  0.017214 seconds (26.20 k allocations: 1.253 MiB, 431.06% compilation time)
  0.004871 seconds (66 allocations: 1.844 KiB)
  0.017035 seconds (12.01 k allocations: 593.398 KiB)
  0.004880 seconds (66 allocations: 1.844 KiB)
  0.004765 seconds (66 allocations: 1.844 KiB)
  0.017215 seconds (14.75 k allocations: 753.328 KiB)
  0.004772 seconds (66 allocations: 1.844 KiB)
```
Note that this is running Julia **on only one thread**.

### GH200 non-SLURM machine

```julia
julia> CUDA.versioninfo()
CUDA toolchain: 
- runtime 12.9, artifact installation
- driver 545.23.8 for 12.3
- compiler 12.9

CUDA libraries: 
- CUBLAS: 12.3.4
- CURAND: 10.3.10
- CUFFT: 11.4.1
- CUSOLVER: 11.7.5
- CUSPARSE: 12.5.10
- CUPTI: 2025.2.1 (API 12.9.1)
- NVML: 12.0.0+545.23.8

Julia packages: 
- CUDA: 5.9.7
- GPUArrays: 11.4.1
- GPUCompiler: 1.8.2
- KernelAbstractions: 0.9.40
- CUDA_Driver_jll: 13.1.0+2
- CUDA_Compiler_jll: 0.4.1+1
- CUDA_Runtime_jll: 0.19.2+0

Toolchain:
- Julia: 1.11.9
- LLVM: 16.0.6

1 device:
  0: GH200 480GB (sm_90, 40.552 GiB / 95.577 GiB available)
```
On this machine, still using only one thread, we can observe a similar behavior when pinning threads before loading CUDA.jl:
```julia
julia> using ThreadPinning; pinthreads(:cores);

julia> using CUDA
```
Then we get the same issue. Interestingly, here the mean is not that much slower than the min, but the min is also very slow. We can see in the `@time` that we basically always get lock conflicts, sometimes 1, sometimes 2, which is probably the reason for the bad min time.
```julia
julia> @benchmark sum($x)
BenchmarkTools.Trial: 208 samples with 1 evaluation per sample.
 Range (min … max):  11.987 ms … 44.017 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     16.009 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.994 ms ±  9.538 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃      █                            ▇       ▄                
  █▄▄▁▁▆▁█▁▁▁▁▁▁▁▁▁▁▁▄▁▁▄▁▁▁▁▁▁▅▁▄▁▄▁▁█▁▄▁▄▁▁▁█▁▁▁▁▁▁▇▁▁▁▁▁▁▆ ▅
  12 ms        Histogram: log(frequency) by time        44 ms <

 Memory estimate: 3.42 KiB, allocs estimate: 153.

julia> test(x)
  0.014886 seconds (200 allocations: 9.500 KiB, 1 lock conflict)
  0.031869 seconds (153 allocations: 3.422 KiB, 2 lock conflicts)
  0.015984 seconds (153 allocations: 3.422 KiB, 1 lock conflict)
  0.031967 seconds (153 allocations: 3.422 KiB, 2 lock conflicts)
  0.043973 seconds (153 allocations: 3.422 KiB, 2 lock conflicts)
  0.016000 seconds (154 allocations: 3.703 KiB, 1 lock conflict)
  0.031970 seconds (153 allocations: 3.422 KiB, 2 lock conflicts)
  0.015990 seconds (153 allocations: 3.422 KiB, 1 lock conflict)
  0.011994 seconds (153 allocations: 3.422 KiB, 1 lock conflict)
  0.031958 seconds (153 allocations: 3.422 KiB, 2 lock conflicts)
  0.015993 seconds (153 allocations: 3.422 KiB, 1 lock conflict)
  0.031971 seconds (153 allocations: 3.422 KiB, 2 lock conflicts)
  0.015993 seconds (153 allocations: 3.422 KiB, 1 lock conflict)
  0.011986 seconds (153 allocations: 3.422 KiB, 1 lock conflict)
  0.031960 seconds (153 allocations: 3.422 KiB, 2 lock conflicts)
  0.015993 seconds (153 allocations: 3.422 KiB, 1 lock conflict)
  0.031971 seconds (153 allocations: 3.422 KiB, 2 lock conflicts)
  0.015995 seconds (153 allocations: 3.422 KiB, 1 lock conflict)
  0.011990 seconds (153 allocations: 3.422 KiB, 1 lock conflict)
  0.031959 seconds (153 allocations: 3.422 KiB, 2 lock conflicts)
```
Without pinning threads:
```julia
julia> @benchmark sum($x)
BenchmarkTools.Trial: 2350 samples with 1 evaluation per sample.
 Range (min … max):  2.107 ms … 2.274 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.116 ms             ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.117 ms ± 4.547 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                    ▂ ▂▄▄▇▇▆▆▅███▆█▇▅▄▅▄▂▄▂▁▁                
  ▂▁▁▁▂▂▃▂▃▃▃▃▅▄▅▆▆██████████████████████████▇▆█▆▆▄▄▄▄▅▃▃▃▃ ▅
  2.11 ms        Histogram: frequency by time       2.12 ms <

 Memory estimate: 3.42 KiB, allocs estimate: 153.

julia> test(x)
  0.002333 seconds (200 allocations: 9.500 KiB)
  0.002144 seconds (153 allocations: 3.422 KiB)
  0.002128 seconds (153 allocations: 3.422 KiB)
  0.002133 seconds (153 allocations: 3.422 KiB)
  0.002130 seconds (153 allocations: 3.422 KiB)
  0.002121 seconds (153 allocations: 3.422 KiB)
  0.002125 seconds (153 allocations: 3.422 KiB)
  0.002124 seconds (153 allocations: 3.422 KiB)
  0.002130 seconds (153 allocations: 3.422 KiB)
  0.002125 seconds (153 allocations: 3.422 KiB)
  0.002129 seconds (153 allocations: 3.422 KiB)
  0.002127 seconds (153 allocations: 3.422 KiB)
  0.002128 seconds (153 allocations: 3.422 KiB)
  0.002132 seconds (153 allocations: 3.422 KiB)
  0.002123 seconds (153 allocations: 3.422 KiB)
  0.002127 seconds (153 allocations: 3.422 KiB)
  0.002129 seconds (153 allocations: 3.422 KiB)
  0.002127 seconds (153 allocations: 3.422 KiB)
  0.002128 seconds (153 allocations: 3.422 KiB)
  0.002127 seconds (153 allocations: 3.422 KiB)
```
If we now pin threads after this benchmark without restarting the session, it does not get slower again. We have to pin before running the MWE in order to trigger the bug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lock conflicts in synchronization cause massive slowdown with SLURM or thread pinning #3060

H100 node on a SLURM cluster

GH200 non-SLURM machine

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Lock conflicts in synchronization cause massive slowdown with SLURM or thread pinning #3060

Description

H100 node on a SLURM cluster

GH200 non-SLURM machine

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions