I'm tring to perform a memory copy on a RTX 4090 GPU. It gives a 2TB/s bandwidth speed. It's clearly exceed the theoretical performance of 4090 which is 1008 GB/s.
To reproduce
The Minimal Working Example (MWE):
using CUDA,LinearAlgebra,BenchmarkTools
device!(0)
nx = ny = 2^11
A = CUDA.zeros(Float32, nx, ny);
B = CUDA.rand(Float32, nx, ny);
t_it1 =@belapsed CUDA.@sync copyto!($A, $B)
t_it2 = @belapsed CUDA.@sync axpy!(2f0,$A, $B)
T_tot1 = 2 * 1 / 1e9 * nx * ny * sizeof(Float32) / t_it1
T_tot2 = 2 * 1 / 1e9 * nx * ny * sizeof(Float32) / t_it2
nx = ny = 2^12
A = CUDA.zeros(Float32, nx, ny);
B = CUDA.rand(Float32, nx, ny);
t_it3 = @belapsed CUDA.@sync copyto!($A, $B)
T_tot3 = 2 * 1 / 1e9 * nx * ny * sizeof(Float32) / t_it3
The 2 in T_tot1 = 2 * 1 / 1e9 * nx * ny * sizeof(Float32) / t_it1 comes from the memory read and write.
Output
The T_tot1 and T_tot2 give the output T_tot1=2452 and T_tot1=2281. While, T_tot3 gives the output T_tot3=895
Version info
Details on Julia:
# please post the output of:
Julia Version 1.9.0-rc1
Commit 3b2e0d8fbc1 (2023-03-07 07:51 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 48 × Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
Threads: 48 on 48 virtual cores
Environment:
JULIA_NUM_THREADS = 48
Details on CUDA:
CUDA runtime 11.8, artifact installation
CUDA driver 12.1
NVIDIA driver 530.30.2
Libraries:
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 12.0.0+530.30.2
Toolchain:
- Julia: 1.9.0-rc1
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86
4 devices:
0: NVIDIA GeForce RTX 4090 (sm_89, 22.758 GiB / 23.988 GiB available)
1: NVIDIA GeForce RTX 4090 (sm_89, 22.883 GiB / 23.988 GiB available)
2: NVIDIA TITAN V (sm_70, 11.770 GiB / 12.000 GiB available)
3: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
Additional context
I also tried on the RTX 2080TI and the TITAN V. I don't see any performance exceeding. It seems that only on the RTX 4090 with the 2^11 * 2^11 size of Float32 matrix(or 2^10 * 2^10 Float64 matrix) will have this behaviour.
I'm tring to perform a memory copy on a RTX 4090 GPU. It gives a 2TB/s bandwidth speed. It's clearly exceed the theoretical performance of 4090 which is 1008 GB/s.
To reproduce
The Minimal Working Example (MWE):
The 2 in
T_tot1 = 2 * 1 / 1e9 * nx * ny * sizeof(Float32) / t_it1comes from the memory read and write.Output
The
T_tot1andT_tot2give the outputT_tot1=2452andT_tot1=2281. While,T_tot3gives the outputT_tot3=895Version info
Details on Julia:
Details on CUDA:
Additional context
I also tried on the RTX 2080TI and the TITAN V. I don't see any performance exceeding. It seems that only on the RTX 4090 with the 2^11 * 2^11 size of Float32 matrix(or 2^10 * 2^10 Float64 matrix) will have this behaviour.