The memory copy speed seems to exceed the hardware limit

I'm tring to perform a memory copy on a RTX 4090 GPU. It gives a 2TB/s bandwidth speed. It's clearly exceed the theoretical performance of 4090 which is 1008 GB/s.

**To reproduce**

The Minimal Working Example (MWE):

```julia
using CUDA,LinearAlgebra,BenchmarkTools
device!(0)
nx = ny = 2^11
A = CUDA.zeros(Float32, nx, ny);
B = CUDA.rand(Float32, nx, ny);
t_it1 =@belapsed CUDA.@sync copyto!($A, $B)
t_it2 = @belapsed CUDA.@sync axpy!(2f0,$A, $B)
T_tot1 = 2 * 1 / 1e9 * nx * ny * sizeof(Float32) / t_it1
T_tot2 = 2 * 1 / 1e9 * nx * ny * sizeof(Float32) / t_it2
nx = ny = 2^12
A = CUDA.zeros(Float32, nx, ny);
B = CUDA.rand(Float32, nx, ny);
t_it3 = @belapsed CUDA.@sync copyto!($A, $B)
T_tot3 = 2 * 1 / 1e9 * nx * ny * sizeof(Float32) / t_it3
```
The 2 in `T_tot1 = 2 * 1 / 1e9 * nx * ny * sizeof(Float32) / t_it1` comes from the memory read and write.


**Output**
The `T_tot1` and `T_tot2` give the output `T_tot1=2452` and `T_tot1=2281`. While, `T_tot3` gives the output `T_tot3=895`

**Version info**

Details on Julia:

```
# please post the output of:
Julia Version 1.9.0-rc1
Commit 3b2e0d8fbc1 (2023-03-07 07:51 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 48 × Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 48 on 48 virtual cores
Environment:
  JULIA_NUM_THREADS = 48
```

Details on CUDA:

```
CUDA runtime 11.8, artifact installation
CUDA driver 12.1
NVIDIA driver 530.30.2

Libraries: 
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 12.0.0+530.30.2

Toolchain:
- Julia: 1.9.0-rc1
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

4 devices:
  0: NVIDIA GeForce RTX 4090 (sm_89, 22.758 GiB / 23.988 GiB available)
  1: NVIDIA GeForce RTX 4090 (sm_89, 22.883 GiB / 23.988 GiB available)
  2: NVIDIA TITAN V (sm_70, 11.770 GiB / 12.000 GiB available)
  3: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
```


**Additional context**

I also tried on the RTX 2080TI and the TITAN V. I don't see any performance exceeding. It seems that only on the RTX 4090 with the 2^11 * 2^11 size of Float32 matrix(or 2^10 * 2^10 Float64 matrix) will have this behaviour. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The memory copy speed seems to exceed the hardware limit #1860

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

The memory copy speed seems to exceed the hardware limit #1860

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions