Add support for family- and architecture-specific features#3124
Add support for family- and architecture-specific features#3124AntonOresten wants to merge 16 commits into
Conversation
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: 1cedc63 | Previous: 0bd53ac | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
101017 ns |
100120 ns |
1.01 |
array/accumulate/Float32/dims=1 |
76644 ns |
75778 ns |
1.01 |
array/accumulate/Float32/dims=1L |
1586294 ns |
1577976 ns |
1.01 |
array/accumulate/Float32/dims=2 |
144293.5 ns |
141092 ns |
1.02 |
array/accumulate/Float32/dims=2L |
658553.5 ns |
653288 ns |
1.01 |
array/accumulate/Int64/1d |
118892 ns |
117448 ns |
1.01 |
array/accumulate/Int64/dims=1 |
80426 ns |
79060 ns |
1.02 |
array/accumulate/Int64/dims=1L |
1695383 ns |
1684513 ns |
1.01 |
array/accumulate/Int64/dims=2 |
156810 ns |
153292 ns |
1.02 |
array/accumulate/Int64/dims=2L |
962994 ns |
959566 ns |
1.00 |
array/broadcast |
20664 ns |
20150 ns |
1.03 |
array/construct |
1284.3 ns |
1237.1 ns |
1.04 |
array/copy |
18199 ns |
17027 ns |
1.07 |
array/copyto!/cpu_to_gpu |
217475 ns |
215009 ns |
1.01 |
array/copyto!/gpu_to_cpu |
287145 ns |
282762 ns |
1.02 |
array/copyto!/gpu_to_gpu |
11012 ns |
10609 ns |
1.04 |
array/iteration/findall/bool |
135241 ns |
132391 ns |
1.02 |
array/iteration/findall/int |
149099 ns |
146933 ns |
1.01 |
array/iteration/findfirst/bool |
82094 ns |
80656 ns |
1.02 |
array/iteration/findfirst/int |
84643 ns |
81676 ns |
1.04 |
array/iteration/findmin/1d |
85238.5 ns |
67518 ns |
1.26 |
array/iteration/findmin/2d |
114471 ns |
112075 ns |
1.02 |
array/iteration/logical |
203286.5 ns |
193238 ns |
1.05 |
array/iteration/scalar |
68624 ns |
64768 ns |
1.06 |
array/permutedims/2d |
52323.5 ns |
49922 ns |
1.05 |
array/permutedims/3d |
52569.5 ns |
50310 ns |
1.04 |
array/permutedims/4d |
51751 ns |
49840 ns |
1.04 |
array/random/rand/Float32 |
12736 ns |
11702 ns |
1.09 |
array/random/rand/Int64 |
24752 ns |
22663 ns |
1.09 |
array/random/rand!/Float32 |
9981 ns |
7893.333333333333 ns |
1.26 |
array/random/rand!/Int64 |
21662 ns |
18003 ns |
1.20 |
array/random/randn/Float32 |
38916.5 ns |
36437 ns |
1.07 |
array/random/randn!/Float32 |
27603.5 ns |
24279 ns |
1.14 |
array/reductions/mapreduce/Float32/1d |
34573 ns |
34273 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=1 |
40101.5 ns |
38242 ns |
1.05 |
array/reductions/mapreduce/Float32/dims=1L |
51208 ns |
50540 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=2 |
58049 ns |
55899 ns |
1.04 |
array/reductions/mapreduce/Float32/dims=2L |
67815 ns |
67329 ns |
1.01 |
array/reductions/mapreduce/Int64/1d |
42907 ns |
40432 ns |
1.06 |
array/reductions/mapreduce/Int64/dims=1 |
42645 ns |
41297 ns |
1.03 |
array/reductions/mapreduce/Int64/dims=1L |
87220 ns |
86612 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=2 |
60970 ns |
58190 ns |
1.05 |
array/reductions/mapreduce/Int64/dims=2L |
84526.5 ns |
82637 ns |
1.02 |
array/reductions/reduce/Float32/1d |
34763 ns |
34228 ns |
1.02 |
array/reductions/reduce/Float32/dims=1 |
49574 ns |
38235 ns |
1.30 |
array/reductions/reduce/Float32/dims=1L |
51443 ns |
50426 ns |
1.02 |
array/reductions/reduce/Float32/dims=2 |
58566 ns |
55757 ns |
1.05 |
array/reductions/reduce/Float32/dims=2L |
68653 ns |
67689 ns |
1.01 |
array/reductions/reduce/Int64/1d |
42918 ns |
40556 ns |
1.06 |
array/reductions/reduce/Int64/dims=1 |
52148 ns |
41261 ns |
1.26 |
array/reductions/reduce/Int64/dims=1L |
87245 ns |
86664 ns |
1.01 |
array/reductions/reduce/Int64/dims=2 |
60850 ns |
58234 ns |
1.04 |
array/reductions/reduce/Int64/dims=2L |
84046 ns |
82724 ns |
1.02 |
array/reverse/1d |
17936.5 ns |
15928 ns |
1.13 |
array/reverse/1dL |
68528 ns |
67783 ns |
1.01 |
array/reverse/1dL_inplace |
65843 ns |
65337 ns |
1.01 |
array/reverse/1d_inplace |
10385.333333333334 ns |
8321 ns |
1.25 |
array/reverse/2d |
20883 ns |
20259 ns |
1.03 |
array/reverse/2dL |
72880 ns |
72140 ns |
1.01 |
array/reverse/2dL_inplace |
65901 ns |
65256 ns |
1.01 |
array/reverse/2d_inplace |
10298 ns |
9782 ns |
1.05 |
array/sorting/1d |
2736103 ns |
2723355 ns |
1.00 |
array/sorting/2d |
1070242.5 ns |
1067801 ns |
1.00 |
array/sorting/by |
3305342 ns |
3303323 ns |
1.00 |
cuda/synchronization/context/auto |
1183.2 ns |
1174.7 ns |
1.01 |
cuda/synchronization/context/blocking |
921.2162162162163 ns |
933.71875 ns |
0.99 |
cuda/synchronization/context/nonblocking |
7714 ns |
6032 ns |
1.28 |
cuda/synchronization/stream/auto |
1045.4 ns |
1029.5 ns |
1.02 |
cuda/synchronization/stream/blocking |
839.8048780487804 ns |
834.7297297297297 ns |
1.01 |
cuda/synchronization/stream/nonblocking |
7304.4 ns |
5867.2 ns |
1.24 |
integration/byval/reference |
143899 ns |
143373 ns |
1.00 |
integration/byval/slices=1 |
145685 ns |
145502 ns |
1.00 |
integration/byval/slices=2 |
284492.5 ns |
283967 ns |
1.00 |
integration/byval/slices=3 |
422937 ns |
422551 ns |
1.00 |
integration/cudadevrt |
102390 ns |
101953 ns |
1.00 |
integration/volumerhs |
11296274.5 ns |
9741344 ns |
1.16 |
kernel/indexing |
13435 ns |
12949 ns |
1.04 |
kernel/indexing_checked |
14130 ns |
13465 ns |
1.05 |
kernel/launch |
2198.4444444444443 ns |
2139.222222222222 ns |
1.03 |
kernel/occupancy |
701.972972972973 ns |
693.993670886076 ns |
1.01 |
kernel/rand |
16090 ns |
16122 ns |
1.00 |
latency/import |
3847676258.5 ns |
3844133780 ns |
1.00 |
latency/precompile |
4643091314 ns |
4620836705 ns |
1.00 |
latency/ttfp |
4517908107.5 ns |
4425510838 ns |
1.02 |
This comment was automatically generated by workflow using github-action-benchmark.
|
While I am convinced EDIT: Shortened |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3124 +/- ##
=======================================
Coverage 16.40% 16.40%
=======================================
Files 124 124
Lines 9827 9827
=======================================
Hits 1612 1612
Misses 8215 8215 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
It's unfortunate that this splits the knowledge about architectures in multiple places; before it was only in the databases of
I'm not sure this is reasonable. NVPTX does actually have subtarget-specific emission paths, see e.g. |
Right, I see your concern:
So this PR unblocks the inline-PTX use case, but to get the LLVM-side benefits we'd want to thread the feature set into I started at the CUDA.jl level because my GPUCompiler.jl mental model isn't deep enough yet to confidently extend Happy to scope this PR to the inline-PTX case and file a follow-up against GPUCompiler.jl for the
The only suffix-aware code that ended up elsewhere is |
|
Here's the GPUCompiler.jl bit: JuliaGPU/GPUCompiler.jl#798 |
feature_set kwarg for selecting PTX target suffix|
Reworked significantly. On the back-end side things are now properly passed down to GPUCompiler.jl, but I also tried to make things more idiomatic on the front-end side. There's a |
I initially chose to be conservative, but you're sure this won't compromise any downstream assumptions / caches if it becomes the default? |
|
I don't think so. |
|
One other thing we need to consider is how to use this for conditional code paths (if we even want to support this conditionality at the host level). For example, You can query it from within the kernel, branching on |
So that'd be a device function like
If the user is branching on compute capability, they're controlling the kernel-launching? So they should be able to control the feature set as well. Since the feature set is now tied to the compute capability in the |
Yes, and it's already in here (supported by the GPUCompiler.jl change).
Yes, I'm not thinking of people doing |
|
Are there any cases in which we don't want architecture-specific codegen, anyway? Linking my earlier comment |
|
When the toolchain doesn't support it? |
|
Actually, I guess we can already do: if runs_on(sm"120a", capability(device()))
@cuda arch=sm"120a" specialized_kernel(x)
else
@cuda fallback_kernel(x)
endSo won't add another abstraction for now. |
Adds support for family-specific (
sm_NNNf) and architecture-specific (sm_NNNa) PTX targets, enabling access to a wider set of low-level instructions. Builds on #3120;PTXCompilerTargethas no field for the suffix, so we rewrite.targetand pass the matching--gpu-nametoptxas, sidestepping LLVM's NVPTX coverage.NVIDIA defines three feature sets:
sm_90): forward-compatible.fsuffix, e.g.sm_100f): same-major-family-portable. Requires CC ≥ 10.0 and PTX ≥ 8.8.asuffix, e.g.sm_90a): locked to one exact CC. Requires CC ≥ 9.0 and PTX ≥ 8.0.with hierarchy baseline ⊆ family ⊆ architecture.
The previous behavior remains as the default through
:baseline. To unblockwgmma,tcgen05, and friends, explicit opt-in is required through a newfeature_setkwarg oncufunction,@cuda, etc. that takes aSymbol.Relevant docs:
Supercedes #3122 (renamed branch; seems beyond rescue)