Skip to content

Lock conflicts in synchronization cause massive slowdown with SLURM or thread pinning #3060

@efaulhaber

Description

@efaulhaber

H100 node on a SLURM cluster

julia> CUDA.versioninfo()
CUDA toolchain: 
- runtime 13.0, artifact installation
- driver 590.48.1 for 13.1
- compiler 13.1

CUDA libraries: 
- CUBLAS: 13.1.0
- CURAND: 10.4.0
- CUFFT: 12.0.0
- CUSOLVER: 12.0.4
- CUSPARSE: 12.6.3
- CUPTI: 2025.3.1 (API 13.0.1)
- NVML: 13.0.0+590.48.1

Julia packages: 
- CUDA: 5.9.6
- GPUArrays: 11.4.1
- GPUCompiler: 1.8.2
- KernelAbstractions: 0.9.40
- CUDA_Driver_jll: 13.1.0+2
- CUDA_Compiler_jll: 0.4.1+1
- CUDA_Runtime_jll: 0.19.2+0

Toolchain:
- Julia: 1.12.5
- LLVM: 18.1.7

Preferences:
- nonblocking_synchronization: true

1 device:
  0: NVIDIA H100 (sm_90, 93.083 GiB / 93.584 GiB available)

Here is a simple MWE. It is fast in the minimum, but non-deterministically slow sometimes, significantly increading the mean.
This is consistent with our observations from real kernels. Especially small kernels are very slow in the mean, so they took a similar time in our code as the large ones, even though they should be multitudes faster.

julia> x = CUDA.ones(1_000_000_000);

julia> @benchmark sum($x)
BenchmarkTools.Trial: 419 samples with 1 evaluation per sample.
 Range (min  max):   2.493 ms  37.205 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     12.993 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.910 ms ±  9.594 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                 ▃█                      ▁               ▃  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇██▆▁█▇▄▁▁▁▅▅▄▁▁▁▁▁▁▁▁▁▄▁█▁▄▁▁▁▁▇▁▆▁▄▁▄▁▆█ ▆
  2.49 ms      Histogram: log(frequency) by time        35 ms <

 Memory estimate: 5.89 KiB, allocs estimate: 325.

It gets even slower when I request 4 cores in SLURM (the above is requesting only one core):

julia> @benchmark sum($x)
BenchmarkTools.Trial: 205 samples with 1 evaluation per sample.
 Range (min  max):   8.331 ms  28.104 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     25.843 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   24.491 ms ±  4.620 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                          █    
  ▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇█▃▄ ▂
  8.33 ms         Histogram: frequency by time        26.7 ms <

 Memory estimate: 430.80 KiB, allocs estimate: 8504.

I noticed that @time reports lock conflicts here on the slow runs.

julia> function test(x)
           for _ in 1:20
               @time sum(x)
           end
       end;

julia> test(x)
  0.011377 seconds (325 allocations: 5.891 KiB, 1 lock conflict)
  0.002555 seconds (325 allocations: 5.891 KiB)
  0.002528 seconds (325 allocations: 5.891 KiB)
  0.002524 seconds (325 allocations: 5.891 KiB)
  0.002526 seconds (326 allocations: 6.172 KiB)
  0.002523 seconds (325 allocations: 5.891 KiB)
  0.002536 seconds (325 allocations: 5.891 KiB)
  0.002524 seconds (325 allocations: 5.891 KiB)
  0.012198 seconds (325 allocations: 5.891 KiB, 1 lock conflict)
  0.026833 seconds (325 allocations: 5.891 KiB, 2 lock conflicts)
  0.034979 seconds (325 allocations: 5.891 KiB, 2 lock conflicts)
  0.012982 seconds (325 allocations: 5.891 KiB, 1 lock conflict)
  0.012985 seconds (325 allocations: 5.891 KiB, 1 lock conflict)
  0.012260 seconds (325 allocations: 5.891 KiB, 1 lock conflict)
  0.025695 seconds (325 allocations: 5.891 KiB, 2 lock conflicts)
  0.035985 seconds (325 allocations: 5.891 KiB, 2 lock conflicts)
  0.012985 seconds (325 allocations: 5.891 KiB, 1 lock conflict)
  0.012983 seconds (325 allocations: 5.891 KiB, 1 lock conflict)
  0.012381 seconds (325 allocations: 5.891 KiB, 1 lock conflict)
  0.002549 seconds (325 allocations: 5.891 KiB)

I could trace back to CUDA.synchronize:

Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] (::Base.var"#slowlock#1199")(rl::ReentrantLock)
    @ Base ./REPL[10]:12
  [3] lock
    @ ./REPL[10]:2 [inlined]
  [4] relockall
    @ ./lock.jl:208 [inlined]
  [5] wait(c::Base.GenericCondition{ReentrantLock}; first::Bool)
    @ Base ./condition.jl:135
  [6] wait
    @ ./condition.jl:125 [inlined]
  [7] put!(c::CUDA.BidirectionalChannel{Union{CuContext, CuEvent, CuStream}, CUDA.cudaError_enum}, v::CuStream)
    @ CUDA ~/.julia/packages/CUDA/FJf6p/lib/cudadrv/synchronization.jl:40
  [8] nonblocking_synchronize(val::CuStream)
    @ CUDA ~/.julia/packages/CUDA/FJf6p/lib/cudadrv/synchronization.jl:168
  [9] synchronize(stream::CuStream; blocking::Bool, spin::Bool)
    @ CUDA ~/.julia/packages/CUDA/FJf6p/lib/cudadrv/synchronization.jl:200
 [10] synchronize (repeats 2 times)
    @ ~/.julia/packages/CUDA/FJf6p/lib/cudadrv/synchronization.jl:194 [inlined]
 [11] synchronize
    @ ~/.julia/packages/CUDA/FJf6p/src/CUDAKernels.jl:32 [inlined]

Now with blocking synchronization,

[CUDA]
nonblocking_synchronization = false

this doesn't happen, and I get

julia> @benchmark sum($x)
BenchmarkTools.Trial: 1131 samples with 1 evaluation per sample.
 Range (min  max):  2.379 ms  26.466 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.390 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.388 ms ±  5.062 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                            ▃                              
  █▃▁▁▁▁▁▁▁▁▃▁▁▁▃▁▁▁▁▃▃▁▁▁▅▅▃▇▆█▆▅▅▆▁▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▃▁▅▁▁▇▁▆ ▇
  2.38 ms      Histogram: log(frequency) by time       25 ms <

 Memory estimate: 1.84 KiB, allocs estimate: 66.

and no lock conflicts reported by @time, but still compilation time for some reason:

julia> test(x)
  0.004839 seconds (66 allocations: 1.844 KiB)
  0.004786 seconds (66 allocations: 1.844 KiB)
  0.004770 seconds (66 allocations: 1.844 KiB)
  0.015916 seconds (21.34 k allocations: 1.015 MiB)
  0.004844 seconds (66 allocations: 1.844 KiB)
  0.017654 seconds (26.20 k allocations: 1.252 MiB, 334.92% compilation time)
  0.004857 seconds (66 allocations: 1.844 KiB)
  0.004764 seconds (66 allocations: 1.844 KiB)
  0.017268 seconds (13.66 k allocations: 667.633 KiB)
  0.004870 seconds (66 allocations: 1.844 KiB)
  0.017050 seconds (22.13 k allocations: 1.053 MiB)
  0.004870 seconds (66 allocations: 1.844 KiB)
  0.004781 seconds (67 allocations: 2.125 KiB)
  0.017214 seconds (26.20 k allocations: 1.253 MiB, 431.06% compilation time)
  0.004871 seconds (66 allocations: 1.844 KiB)
  0.017035 seconds (12.01 k allocations: 593.398 KiB)
  0.004880 seconds (66 allocations: 1.844 KiB)
  0.004765 seconds (66 allocations: 1.844 KiB)
  0.017215 seconds (14.75 k allocations: 753.328 KiB)
  0.004772 seconds (66 allocations: 1.844 KiB)

Note that this is running Julia on only one thread.

GH200 non-SLURM machine

julia> CUDA.versioninfo()
CUDA toolchain: 
- runtime 12.9, artifact installation
- driver 545.23.8 for 12.3
- compiler 12.9

CUDA libraries: 
- CUBLAS: 12.3.4
- CURAND: 10.3.10
- CUFFT: 11.4.1
- CUSOLVER: 11.7.5
- CUSPARSE: 12.5.10
- CUPTI: 2025.2.1 (API 12.9.1)
- NVML: 12.0.0+545.23.8

Julia packages: 
- CUDA: 5.9.7
- GPUArrays: 11.4.1
- GPUCompiler: 1.8.2
- KernelAbstractions: 0.9.40
- CUDA_Driver_jll: 13.1.0+2
- CUDA_Compiler_jll: 0.4.1+1
- CUDA_Runtime_jll: 0.19.2+0

Toolchain:
- Julia: 1.11.9
- LLVM: 16.0.6

1 device:
  0: GH200 480GB (sm_90, 40.552 GiB / 95.577 GiB available)

On this machine, still using only one thread, we can observe a similar behavior when pinning threads before loading CUDA.jl:

julia> using ThreadPinning; pinthreads(:cores);

julia> using CUDA

Then we get the same issue. Interestingly, here the mean is not that much slower than the min, but the min is also very slow. We can see in the @time that we basically always get lock conflicts, sometimes 1, sometimes 2, which is probably the reason for the bad min time.

julia> @benchmark sum($x)
BenchmarkTools.Trial: 208 samples with 1 evaluation per sample.
 Range (min  max):  11.987 ms  44.017 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     16.009 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.994 ms ±  9.538 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃      █                            ▇       ▄                
  █▄▄▁▁▆▁█▁▁▁▁▁▁▁▁▁▁▁▄▁▁▄▁▁▁▁▁▁▅▁▄▁▄▁▁█▁▄▁▄▁▁▁█▁▁▁▁▁▁▇▁▁▁▁▁▁▆ ▅
  12 ms        Histogram: log(frequency) by time        44 ms <

 Memory estimate: 3.42 KiB, allocs estimate: 153.

julia> test(x)
  0.014886 seconds (200 allocations: 9.500 KiB, 1 lock conflict)
  0.031869 seconds (153 allocations: 3.422 KiB, 2 lock conflicts)
  0.015984 seconds (153 allocations: 3.422 KiB, 1 lock conflict)
  0.031967 seconds (153 allocations: 3.422 KiB, 2 lock conflicts)
  0.043973 seconds (153 allocations: 3.422 KiB, 2 lock conflicts)
  0.016000 seconds (154 allocations: 3.703 KiB, 1 lock conflict)
  0.031970 seconds (153 allocations: 3.422 KiB, 2 lock conflicts)
  0.015990 seconds (153 allocations: 3.422 KiB, 1 lock conflict)
  0.011994 seconds (153 allocations: 3.422 KiB, 1 lock conflict)
  0.031958 seconds (153 allocations: 3.422 KiB, 2 lock conflicts)
  0.015993 seconds (153 allocations: 3.422 KiB, 1 lock conflict)
  0.031971 seconds (153 allocations: 3.422 KiB, 2 lock conflicts)
  0.015993 seconds (153 allocations: 3.422 KiB, 1 lock conflict)
  0.011986 seconds (153 allocations: 3.422 KiB, 1 lock conflict)
  0.031960 seconds (153 allocations: 3.422 KiB, 2 lock conflicts)
  0.015993 seconds (153 allocations: 3.422 KiB, 1 lock conflict)
  0.031971 seconds (153 allocations: 3.422 KiB, 2 lock conflicts)
  0.015995 seconds (153 allocations: 3.422 KiB, 1 lock conflict)
  0.011990 seconds (153 allocations: 3.422 KiB, 1 lock conflict)
  0.031959 seconds (153 allocations: 3.422 KiB, 2 lock conflicts)

Without pinning threads:

julia> @benchmark sum($x)
BenchmarkTools.Trial: 2350 samples with 1 evaluation per sample.
 Range (min  max):  2.107 ms  2.274 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.116 ms             ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.117 ms ± 4.547 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                    ▂ ▂▄▄▇▇▆▆▅███▆█▇▅▄▅▄▂▄▂▁▁                
  ▂▁▁▁▂▂▃▂▃▃▃▃▅▄▅▆▆██████████████████████████▇▆█▆▆▄▄▄▄▅▃▃▃▃ ▅
  2.11 ms        Histogram: frequency by time       2.12 ms <

 Memory estimate: 3.42 KiB, allocs estimate: 153.

julia> test(x)
  0.002333 seconds (200 allocations: 9.500 KiB)
  0.002144 seconds (153 allocations: 3.422 KiB)
  0.002128 seconds (153 allocations: 3.422 KiB)
  0.002133 seconds (153 allocations: 3.422 KiB)
  0.002130 seconds (153 allocations: 3.422 KiB)
  0.002121 seconds (153 allocations: 3.422 KiB)
  0.002125 seconds (153 allocations: 3.422 KiB)
  0.002124 seconds (153 allocations: 3.422 KiB)
  0.002130 seconds (153 allocations: 3.422 KiB)
  0.002125 seconds (153 allocations: 3.422 KiB)
  0.002129 seconds (153 allocations: 3.422 KiB)
  0.002127 seconds (153 allocations: 3.422 KiB)
  0.002128 seconds (153 allocations: 3.422 KiB)
  0.002132 seconds (153 allocations: 3.422 KiB)
  0.002123 seconds (153 allocations: 3.422 KiB)
  0.002127 seconds (153 allocations: 3.422 KiB)
  0.002129 seconds (153 allocations: 3.422 KiB)
  0.002127 seconds (153 allocations: 3.422 KiB)
  0.002128 seconds (153 allocations: 3.422 KiB)
  0.002127 seconds (153 allocations: 3.422 KiB)

If we now pin threads after this benchmark without restarting the session, it does not get slower again. We have to pin before running the MWE in order to trigger the bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingperformanceHow fast can we go?

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions