Skip to content

Add a dispatch for LinearAlgebra.norm2#2302

Merged
maleadt merged 1 commit into
JuliaGPU:mainfrom
sharanry:sy/strided_norm2
May 16, 2026
Merged

Add a dispatch for LinearAlgebra.norm2#2302
maleadt merged 1 commit into
JuliaGPU:mainfrom
sharanry:sy/strided_norm2

Conversation

@sharanry
Copy link
Copy Markdown
Contributor

norm(@view x[..], 2) was previously leading to a call of LinearAlgebra.generic_norm2 which led to a scalar indexing. This catches such cuda subarray norm2 calls earlier.

Inf-norm and p-norm with cuda subarrays still lead to the following dispatches:

LinearAlgebra.generic_normInf(x) = float(mapreduce(norm, max, x))
LinearAlgebra.generic_norm1(x) = mapreduce(float  norm, +, x)

I am not sure if there is a better way to dispatch the above.

should resolve #2280

@maleadt
Copy link
Copy Markdown
Member

maleadt commented Mar 27, 2024

What about generalizing the LinearAlgebra.norm method above to StridedCuArray? That seems cleaner than overriding an internal method.

@maleadt maleadt added needs changes Changes are needed. labels May 24, 2024
@maleadt maleadt marked this pull request as draft May 24, 2024 13:34
@maleadt maleadt marked this pull request as draft May 24, 2024 13:34
@maleadt maleadt added cuda array Stuff about CuArray. labels May 24, 2024
@maleadt maleadt force-pushed the master branch 15 times, most recently from 5d585c4 to c850163 Compare December 20, 2024 08:18
@Azercoco
Copy link
Copy Markdown

Azercoco commented Jan 6, 2025

Hi, what's the status of this PR ? This issue is troublemsome for one of my code and I would like to know if the fix will be implemented into CUDA.jl

@maleadt
Copy link
Copy Markdown
Member

maleadt commented Jan 6, 2025

The PR fails CI, and there's an outstanding comment of mine, so it needs work I'd say. Feel free to take it up if you want.

@maleadt
Copy link
Copy Markdown
Member

maleadt commented May 15, 2026

Rebased. Depends on JuliaGPU/GPUArrays.jl#720 now.

maleadt added a commit to sharanry/CUDA.jl that referenced this pull request May 15, 2026
Generalizes the BLAS-optimized `norm`/`norm2` methods from `DenseCuArray`
to `StridedCuVecOrDenseMat`, so 1D strided subarray views also dispatch
to `nrm2`. Multi-dim non-contiguous views go through the sum-based
fallback in GPUArrays (which now dispatches on `AnyGPUArray`).

Resolves JuliaGPU#2280, replaces JuliaGPU#2302.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maleadt maleadt force-pushed the sy/strided_norm2 branch from 8d4a91f to fa536f3 Compare May 15, 2026 17:41
maleadt added a commit to sharanry/CUDA.jl that referenced this pull request May 15, 2026
Generalizes the BLAS-optimized `norm`/`norm2` methods from `DenseCuArray`
to `StridedCuVecOrDenseMat`, so 1D strided subarray views also dispatch
to `nrm2`. Multi-dim non-contiguous views go through the sum-based
fallback in GPUArrays (which now dispatches on `AnyGPUArray`).

Resolves JuliaGPU#2280, replaces JuliaGPU#2302.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@e264549). Learn more about missing BASE report.

Files with missing lines Patch % Lines
lib/cublas/src/linalg.jl 66.66% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2302   +/-   ##
=======================================
  Coverage        ?   16.39%           
=======================================
  Files           ?      124           
  Lines           ?     9827           
  Branches        ?        0           
=======================================
  Hits            ?     1611           
  Misses          ?     8216           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@maleadt maleadt marked this pull request as ready for review May 15, 2026 19:30
Generalizes the BLAS-optimized `norm`/`norm2` methods from `DenseCuArray`
to `StridedCuVecOrDenseMat`, so 1D strided subarray views also dispatch
to `nrm2`. Multi-dim non-contiguous views go through the sum-based
fallback in GPUArrays (which now dispatches on `AnyGPUArray`).

Resolves JuliaGPU#2280, replaces JuliaGPU#2302.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maleadt maleadt force-pushed the sy/strided_norm2 branch from fa536f3 to 156a51a Compare May 15, 2026 19:39
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: 156a51a Previous: 2ace55f Ratio
array/accumulate/Float32/1d 101544 ns 100927 ns 1.01
array/accumulate/Float32/dims=1 76473 ns 77244 ns 0.99
array/accumulate/Float32/dims=1L 1585314 ns 1586192 ns 1.00
array/accumulate/Float32/dims=2 143576 ns 144286 ns 1.00
array/accumulate/Float32/dims=2L 657193.5 ns 658394 ns 1.00
array/accumulate/Int64/1d 118474 ns 118643 ns 1.00
array/accumulate/Int64/dims=1 79908 ns 79829 ns 1.00
array/accumulate/Int64/dims=1L 1706190.5 ns 1694383.5 ns 1.01
array/accumulate/Int64/dims=2 156617 ns 156179.5 ns 1.00
array/accumulate/Int64/dims=2L 961814 ns 961974 ns 1.00
array/broadcast 20524 ns 20478 ns 1.00
array/construct 1257.05 ns 1277.6 ns 0.98
array/copy 17916 ns 18044 ns 0.99
array/copyto!/cpu_to_gpu 213898 ns 215225 ns 0.99
array/copyto!/gpu_to_cpu 281245 ns 282660 ns 0.99
array/copyto!/gpu_to_gpu 10695 ns 10909 ns 0.98
array/iteration/findall/bool 134649 ns 134548 ns 1.00
array/iteration/findall/int 148368 ns 149686 ns 0.99
array/iteration/findfirst/bool 81151 ns 81331 ns 1.00
array/iteration/findfirst/int 83416.5 ns 83732 ns 1.00
array/iteration/findmin/1d 85341 ns 86429 ns 0.99
array/iteration/findmin/2d 114129.5 ns 117614.5 ns 0.97
array/iteration/logical 200491.5 ns 199724.5 ns 1.00
array/iteration/scalar 67792 ns 68306 ns 0.99
array/permutedims/2d 52319.5 ns 52646 ns 0.99
array/permutedims/3d 52444 ns 52872 ns 0.99
array/permutedims/4d 51338 ns 51484 ns 1.00
array/random/rand/Float32 13335 ns 12708 ns 1.05
array/random/rand/Int64 24379 ns 25198 ns 0.97
array/random/rand!/Float32 9851.666666666666 ns 8737.333333333334 ns 1.13
array/random/rand!/Int64 21090 ns 22022 ns 0.96
array/random/randn/Float32 43186 ns 42861 ns 1.01
array/random/randn!/Float32 30810 ns 30594 ns 1.01
array/reductions/mapreduce/Float32/1d 34712 ns 34442 ns 1.01
array/reductions/mapreduce/Float32/dims=1 40828 ns 39651 ns 1.03
array/reductions/mapreduce/Float32/dims=1L 51311 ns 51115 ns 1.00
array/reductions/mapreduce/Float32/dims=2 58180 ns 56288.5 ns 1.03
array/reductions/mapreduce/Float32/dims=2L 67774 ns 69317 ns 0.98
array/reductions/mapreduce/Int64/1d 42789.5 ns 42454.5 ns 1.01
array/reductions/mapreduce/Int64/dims=1 51811 ns 50760 ns 1.02
array/reductions/mapreduce/Int64/dims=1L 87260 ns 87043.5 ns 1.00
array/reductions/mapreduce/Int64/dims=2 60671.5 ns 59370.5 ns 1.02
array/reductions/mapreduce/Int64/dims=2L 84038 ns 84559 ns 0.99
array/reductions/reduce/Float32/1d 34945 ns 34787 ns 1.00
array/reductions/reduce/Float32/dims=1 39822.5 ns 39944 ns 1.00
array/reductions/reduce/Float32/dims=1L 51236 ns 51284 ns 1.00
array/reductions/reduce/Float32/dims=2 58238 ns 56294 ns 1.03
array/reductions/reduce/Float32/dims=2L 68142 ns 69595 ns 0.98
array/reductions/reduce/Int64/1d 42679 ns 42459 ns 1.01
array/reductions/reduce/Int64/dims=1 43466 ns 50053 ns 0.87
array/reductions/reduce/Int64/dims=1L 87194 ns 87114 ns 1.00
array/reductions/reduce/Int64/dims=2 60654 ns 59331.5 ns 1.02
array/reductions/reduce/Int64/dims=2L 84499.5 ns 84197 ns 1.00
array/reverse/1d 17902 ns 17694 ns 1.01
array/reverse/1dL 68542 ns 68202 ns 1.00
array/reverse/1dL_inplace 65769 ns 65756 ns 1.00
array/reverse/1d_inplace 8541.666666666666 ns 10156.166666666668 ns 0.84
array/reverse/2d 20901 ns 21001 ns 1.00
array/reverse/2dL 73023 ns 73024 ns 1.00
array/reverse/2dL_inplace 65914 ns 65634 ns 1.00
array/reverse/2d_inplace 9973 ns 11107 ns 0.90
array/sorting/1d 2734742 ns 2736491 ns 1.00
array/sorting/2d 1068254 ns 1070402 ns 1.00
array/sorting/by 3303672 ns 3304900 ns 1.00
cuda/synchronization/context/auto 1145.3 ns 1123.1 ns 1.02
cuda/synchronization/context/blocking 921.1923076923077 ns 902.5957446808511 ns 1.02
cuda/synchronization/context/nonblocking 7130.2 ns 7638.8 ns 0.93
cuda/synchronization/stream/auto 992.5625 ns 980.6666666666666 ns 1.01
cuda/synchronization/stream/blocking 833.6666666666666 ns 806.5 ns 1.03
cuda/synchronization/stream/nonblocking 7230.299999999999 ns 7187.8 ns 1.01
integration/byval/reference 143781 ns 143733 ns 1.00
integration/byval/slices=1 145763 ns 145971 ns 1.00
integration/byval/slices=2 284545 ns 284607 ns 1.00
integration/byval/slices=3 423071 ns 423028 ns 1.00
integration/cudadevrt 102317 ns 102291 ns 1.00
integration/volumerhs 23424198.5 ns 23455094 ns 1.00
kernel/indexing 13267 ns 13164 ns 1.01
kernel/indexing_checked 13841 ns 13822 ns 1.00
kernel/launch 2182.3333333333335 ns 2137 ns 1.02
kernel/occupancy 675.5125 ns 674.2788461538462 ns 1.00
kernel/rand 17207 ns 14157 ns 1.22
latency/import 3848898595 ns 3799013573 ns 1.01
latency/precompile 4628935725 ns 4593655026.5 ns 1.01
latency/ttfp 4456320918 ns 4367019984.5 ns 1.02

This comment was automatically generated by workflow using github-action-benchmark.

@maleadt maleadt merged commit 37d99a9 into JuliaGPU:main May 16, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda array Stuff about CuArray. good first issue Good for newcomers needs changes Changes are needed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2-norm for views of CuArray falls back to scalar indexing

3 participants