Add a dispatch for LinearAlgebra.norm2#2302
Conversation
413c397 to
0e2ef84
Compare
|
What about generalizing the |
5d585c4 to
c850163
Compare
|
Hi, what's the status of this PR ? This issue is troublemsome for one of my code and I would like to know if the fix will be implemented into CUDA.jl |
|
The PR fails CI, and there's an outstanding comment of mine, so it needs work I'd say. Feel free to take it up if you want. |
|
Rebased. Depends on JuliaGPU/GPUArrays.jl#720 now. |
Generalizes the BLAS-optimized `norm`/`norm2` methods from `DenseCuArray` to `StridedCuVecOrDenseMat`, so 1D strided subarray views also dispatch to `nrm2`. Multi-dim non-contiguous views go through the sum-based fallback in GPUArrays (which now dispatches on `AnyGPUArray`). Resolves JuliaGPU#2280, replaces JuliaGPU#2302. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generalizes the BLAS-optimized `norm`/`norm2` methods from `DenseCuArray` to `StridedCuVecOrDenseMat`, so 1D strided subarray views also dispatch to `nrm2`. Multi-dim non-contiguous views go through the sum-based fallback in GPUArrays (which now dispatches on `AnyGPUArray`). Resolves JuliaGPU#2280, replaces JuliaGPU#2302. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2302 +/- ##
=======================================
Coverage ? 16.39%
=======================================
Files ? 124
Lines ? 9827
Branches ? 0
=======================================
Hits ? 1611
Misses ? 8216
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Generalizes the BLAS-optimized `norm`/`norm2` methods from `DenseCuArray` to `StridedCuVecOrDenseMat`, so 1D strided subarray views also dispatch to `nrm2`. Multi-dim non-contiguous views go through the sum-based fallback in GPUArrays (which now dispatches on `AnyGPUArray`). Resolves JuliaGPU#2280, replaces JuliaGPU#2302. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: 156a51a | Previous: 2ace55f | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
101544 ns |
100927 ns |
1.01 |
array/accumulate/Float32/dims=1 |
76473 ns |
77244 ns |
0.99 |
array/accumulate/Float32/dims=1L |
1585314 ns |
1586192 ns |
1.00 |
array/accumulate/Float32/dims=2 |
143576 ns |
144286 ns |
1.00 |
array/accumulate/Float32/dims=2L |
657193.5 ns |
658394 ns |
1.00 |
array/accumulate/Int64/1d |
118474 ns |
118643 ns |
1.00 |
array/accumulate/Int64/dims=1 |
79908 ns |
79829 ns |
1.00 |
array/accumulate/Int64/dims=1L |
1706190.5 ns |
1694383.5 ns |
1.01 |
array/accumulate/Int64/dims=2 |
156617 ns |
156179.5 ns |
1.00 |
array/accumulate/Int64/dims=2L |
961814 ns |
961974 ns |
1.00 |
array/broadcast |
20524 ns |
20478 ns |
1.00 |
array/construct |
1257.05 ns |
1277.6 ns |
0.98 |
array/copy |
17916 ns |
18044 ns |
0.99 |
array/copyto!/cpu_to_gpu |
213898 ns |
215225 ns |
0.99 |
array/copyto!/gpu_to_cpu |
281245 ns |
282660 ns |
0.99 |
array/copyto!/gpu_to_gpu |
10695 ns |
10909 ns |
0.98 |
array/iteration/findall/bool |
134649 ns |
134548 ns |
1.00 |
array/iteration/findall/int |
148368 ns |
149686 ns |
0.99 |
array/iteration/findfirst/bool |
81151 ns |
81331 ns |
1.00 |
array/iteration/findfirst/int |
83416.5 ns |
83732 ns |
1.00 |
array/iteration/findmin/1d |
85341 ns |
86429 ns |
0.99 |
array/iteration/findmin/2d |
114129.5 ns |
117614.5 ns |
0.97 |
array/iteration/logical |
200491.5 ns |
199724.5 ns |
1.00 |
array/iteration/scalar |
67792 ns |
68306 ns |
0.99 |
array/permutedims/2d |
52319.5 ns |
52646 ns |
0.99 |
array/permutedims/3d |
52444 ns |
52872 ns |
0.99 |
array/permutedims/4d |
51338 ns |
51484 ns |
1.00 |
array/random/rand/Float32 |
13335 ns |
12708 ns |
1.05 |
array/random/rand/Int64 |
24379 ns |
25198 ns |
0.97 |
array/random/rand!/Float32 |
9851.666666666666 ns |
8737.333333333334 ns |
1.13 |
array/random/rand!/Int64 |
21090 ns |
22022 ns |
0.96 |
array/random/randn/Float32 |
43186 ns |
42861 ns |
1.01 |
array/random/randn!/Float32 |
30810 ns |
30594 ns |
1.01 |
array/reductions/mapreduce/Float32/1d |
34712 ns |
34442 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=1 |
40828 ns |
39651 ns |
1.03 |
array/reductions/mapreduce/Float32/dims=1L |
51311 ns |
51115 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2 |
58180 ns |
56288.5 ns |
1.03 |
array/reductions/mapreduce/Float32/dims=2L |
67774 ns |
69317 ns |
0.98 |
array/reductions/mapreduce/Int64/1d |
42789.5 ns |
42454.5 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=1 |
51811 ns |
50760 ns |
1.02 |
array/reductions/mapreduce/Int64/dims=1L |
87260 ns |
87043.5 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
60671.5 ns |
59370.5 ns |
1.02 |
array/reductions/mapreduce/Int64/dims=2L |
84038 ns |
84559 ns |
0.99 |
array/reductions/reduce/Float32/1d |
34945 ns |
34787 ns |
1.00 |
array/reductions/reduce/Float32/dims=1 |
39822.5 ns |
39944 ns |
1.00 |
array/reductions/reduce/Float32/dims=1L |
51236 ns |
51284 ns |
1.00 |
array/reductions/reduce/Float32/dims=2 |
58238 ns |
56294 ns |
1.03 |
array/reductions/reduce/Float32/dims=2L |
68142 ns |
69595 ns |
0.98 |
array/reductions/reduce/Int64/1d |
42679 ns |
42459 ns |
1.01 |
array/reductions/reduce/Int64/dims=1 |
43466 ns |
50053 ns |
0.87 |
array/reductions/reduce/Int64/dims=1L |
87194 ns |
87114 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
60654 ns |
59331.5 ns |
1.02 |
array/reductions/reduce/Int64/dims=2L |
84499.5 ns |
84197 ns |
1.00 |
array/reverse/1d |
17902 ns |
17694 ns |
1.01 |
array/reverse/1dL |
68542 ns |
68202 ns |
1.00 |
array/reverse/1dL_inplace |
65769 ns |
65756 ns |
1.00 |
array/reverse/1d_inplace |
8541.666666666666 ns |
10156.166666666668 ns |
0.84 |
array/reverse/2d |
20901 ns |
21001 ns |
1.00 |
array/reverse/2dL |
73023 ns |
73024 ns |
1.00 |
array/reverse/2dL_inplace |
65914 ns |
65634 ns |
1.00 |
array/reverse/2d_inplace |
9973 ns |
11107 ns |
0.90 |
array/sorting/1d |
2734742 ns |
2736491 ns |
1.00 |
array/sorting/2d |
1068254 ns |
1070402 ns |
1.00 |
array/sorting/by |
3303672 ns |
3304900 ns |
1.00 |
cuda/synchronization/context/auto |
1145.3 ns |
1123.1 ns |
1.02 |
cuda/synchronization/context/blocking |
921.1923076923077 ns |
902.5957446808511 ns |
1.02 |
cuda/synchronization/context/nonblocking |
7130.2 ns |
7638.8 ns |
0.93 |
cuda/synchronization/stream/auto |
992.5625 ns |
980.6666666666666 ns |
1.01 |
cuda/synchronization/stream/blocking |
833.6666666666666 ns |
806.5 ns |
1.03 |
cuda/synchronization/stream/nonblocking |
7230.299999999999 ns |
7187.8 ns |
1.01 |
integration/byval/reference |
143781 ns |
143733 ns |
1.00 |
integration/byval/slices=1 |
145763 ns |
145971 ns |
1.00 |
integration/byval/slices=2 |
284545 ns |
284607 ns |
1.00 |
integration/byval/slices=3 |
423071 ns |
423028 ns |
1.00 |
integration/cudadevrt |
102317 ns |
102291 ns |
1.00 |
integration/volumerhs |
23424198.5 ns |
23455094 ns |
1.00 |
kernel/indexing |
13267 ns |
13164 ns |
1.01 |
kernel/indexing_checked |
13841 ns |
13822 ns |
1.00 |
kernel/launch |
2182.3333333333335 ns |
2137 ns |
1.02 |
kernel/occupancy |
675.5125 ns |
674.2788461538462 ns |
1.00 |
kernel/rand |
17207 ns |
14157 ns |
1.22 |
latency/import |
3848898595 ns |
3799013573 ns |
1.01 |
latency/precompile |
4628935725 ns |
4593655026.5 ns |
1.01 |
latency/ttfp |
4456320918 ns |
4367019984.5 ns |
1.02 |
This comment was automatically generated by workflow using github-action-benchmark.
norm(@view x[..], 2)was previously leading to a call ofLinearAlgebra.generic_norm2which led to a scalar indexing. This catches such cuda subarray norm2 calls earlier.Inf-norm and p-norm with cuda subarrays still lead to the following dispatches:
I am not sure if there is a better way to dispatch the above.
should resolve #2280