Skip to content

Add support for family- and architecture-specific features#3124

Open
AntonOresten wants to merge 16 commits into
JuliaGPU:mainfrom
AntonOresten:feature-set
Open

Add support for family- and architecture-specific features#3124
AntonOresten wants to merge 16 commits into
JuliaGPU:mainfrom
AntonOresten:feature-set

Conversation

@AntonOresten
Copy link
Copy Markdown
Contributor

@AntonOresten AntonOresten commented Apr 30, 2026

Adds support for family-specific (sm_NNNf) and architecture-specific (sm_NNNa) PTX targets, enabling access to a wider set of low-level instructions. Builds on #3120; PTXCompilerTarget has no field for the suffix, so we rewrite .target and pass the matching --gpu-name to ptxas, sidestepping LLVM's NVPTX coverage.

NVIDIA defines three feature sets:

  • baseline (no suffix, e.g. sm_90): forward-compatible.
  • family (f suffix, e.g. sm_100f): same-major-family-portable. Requires CC ≥ 10.0 and PTX ≥ 8.8.
  • architecture (a suffix, e.g. sm_90a): locked to one exact CC. Requires CC ≥ 9.0 and PTX ≥ 8.0.

with hierarchy baseline ⊆ family ⊆ architecture.

The previous behavior remains as the default through :baseline. To unblock wgmma, tcgen05, and friends, explicit opt-in is required through a new feature_set kwarg on cufunction, @cuda, etc. that takes a Symbol.

Relevant docs:

Supercedes #3122 (renamed branch; seems beyond rescue)

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: 1cedc63 Previous: 0bd53ac Ratio
array/accumulate/Float32/1d 101017 ns 100120 ns 1.01
array/accumulate/Float32/dims=1 76644 ns 75778 ns 1.01
array/accumulate/Float32/dims=1L 1586294 ns 1577976 ns 1.01
array/accumulate/Float32/dims=2 144293.5 ns 141092 ns 1.02
array/accumulate/Float32/dims=2L 658553.5 ns 653288 ns 1.01
array/accumulate/Int64/1d 118892 ns 117448 ns 1.01
array/accumulate/Int64/dims=1 80426 ns 79060 ns 1.02
array/accumulate/Int64/dims=1L 1695383 ns 1684513 ns 1.01
array/accumulate/Int64/dims=2 156810 ns 153292 ns 1.02
array/accumulate/Int64/dims=2L 962994 ns 959566 ns 1.00
array/broadcast 20664 ns 20150 ns 1.03
array/construct 1284.3 ns 1237.1 ns 1.04
array/copy 18199 ns 17027 ns 1.07
array/copyto!/cpu_to_gpu 217475 ns 215009 ns 1.01
array/copyto!/gpu_to_cpu 287145 ns 282762 ns 1.02
array/copyto!/gpu_to_gpu 11012 ns 10609 ns 1.04
array/iteration/findall/bool 135241 ns 132391 ns 1.02
array/iteration/findall/int 149099 ns 146933 ns 1.01
array/iteration/findfirst/bool 82094 ns 80656 ns 1.02
array/iteration/findfirst/int 84643 ns 81676 ns 1.04
array/iteration/findmin/1d 85238.5 ns 67518 ns 1.26
array/iteration/findmin/2d 114471 ns 112075 ns 1.02
array/iteration/logical 203286.5 ns 193238 ns 1.05
array/iteration/scalar 68624 ns 64768 ns 1.06
array/permutedims/2d 52323.5 ns 49922 ns 1.05
array/permutedims/3d 52569.5 ns 50310 ns 1.04
array/permutedims/4d 51751 ns 49840 ns 1.04
array/random/rand/Float32 12736 ns 11702 ns 1.09
array/random/rand/Int64 24752 ns 22663 ns 1.09
array/random/rand!/Float32 9981 ns 7893.333333333333 ns 1.26
array/random/rand!/Int64 21662 ns 18003 ns 1.20
array/random/randn/Float32 38916.5 ns 36437 ns 1.07
array/random/randn!/Float32 27603.5 ns 24279 ns 1.14
array/reductions/mapreduce/Float32/1d 34573 ns 34273 ns 1.01
array/reductions/mapreduce/Float32/dims=1 40101.5 ns 38242 ns 1.05
array/reductions/mapreduce/Float32/dims=1L 51208 ns 50540 ns 1.01
array/reductions/mapreduce/Float32/dims=2 58049 ns 55899 ns 1.04
array/reductions/mapreduce/Float32/dims=2L 67815 ns 67329 ns 1.01
array/reductions/mapreduce/Int64/1d 42907 ns 40432 ns 1.06
array/reductions/mapreduce/Int64/dims=1 42645 ns 41297 ns 1.03
array/reductions/mapreduce/Int64/dims=1L 87220 ns 86612 ns 1.01
array/reductions/mapreduce/Int64/dims=2 60970 ns 58190 ns 1.05
array/reductions/mapreduce/Int64/dims=2L 84526.5 ns 82637 ns 1.02
array/reductions/reduce/Float32/1d 34763 ns 34228 ns 1.02
array/reductions/reduce/Float32/dims=1 49574 ns 38235 ns 1.30
array/reductions/reduce/Float32/dims=1L 51443 ns 50426 ns 1.02
array/reductions/reduce/Float32/dims=2 58566 ns 55757 ns 1.05
array/reductions/reduce/Float32/dims=2L 68653 ns 67689 ns 1.01
array/reductions/reduce/Int64/1d 42918 ns 40556 ns 1.06
array/reductions/reduce/Int64/dims=1 52148 ns 41261 ns 1.26
array/reductions/reduce/Int64/dims=1L 87245 ns 86664 ns 1.01
array/reductions/reduce/Int64/dims=2 60850 ns 58234 ns 1.04
array/reductions/reduce/Int64/dims=2L 84046 ns 82724 ns 1.02
array/reverse/1d 17936.5 ns 15928 ns 1.13
array/reverse/1dL 68528 ns 67783 ns 1.01
array/reverse/1dL_inplace 65843 ns 65337 ns 1.01
array/reverse/1d_inplace 10385.333333333334 ns 8321 ns 1.25
array/reverse/2d 20883 ns 20259 ns 1.03
array/reverse/2dL 72880 ns 72140 ns 1.01
array/reverse/2dL_inplace 65901 ns 65256 ns 1.01
array/reverse/2d_inplace 10298 ns 9782 ns 1.05
array/sorting/1d 2736103 ns 2723355 ns 1.00
array/sorting/2d 1070242.5 ns 1067801 ns 1.00
array/sorting/by 3305342 ns 3303323 ns 1.00
cuda/synchronization/context/auto 1183.2 ns 1174.7 ns 1.01
cuda/synchronization/context/blocking 921.2162162162163 ns 933.71875 ns 0.99
cuda/synchronization/context/nonblocking 7714 ns 6032 ns 1.28
cuda/synchronization/stream/auto 1045.4 ns 1029.5 ns 1.02
cuda/synchronization/stream/blocking 839.8048780487804 ns 834.7297297297297 ns 1.01
cuda/synchronization/stream/nonblocking 7304.4 ns 5867.2 ns 1.24
integration/byval/reference 143899 ns 143373 ns 1.00
integration/byval/slices=1 145685 ns 145502 ns 1.00
integration/byval/slices=2 284492.5 ns 283967 ns 1.00
integration/byval/slices=3 422937 ns 422551 ns 1.00
integration/cudadevrt 102390 ns 101953 ns 1.00
integration/volumerhs 11296274.5 ns 9741344 ns 1.16
kernel/indexing 13435 ns 12949 ns 1.04
kernel/indexing_checked 14130 ns 13465 ns 1.05
kernel/launch 2198.4444444444443 ns 2139.222222222222 ns 1.03
kernel/occupancy 701.972972972973 ns 693.993670886076 ns 1.01
kernel/rand 16090 ns 16122 ns 1.00
latency/import 3847676258.5 ns 3844133780 ns 1.00
latency/precompile 4643091314 ns 4620836705 ns 1.00
latency/ttfp 4517908107.5 ns 4425510838 ns 1.02

This comment was automatically generated by workflow using github-action-benchmark.

@AntonOresten
Copy link
Copy Markdown
Contributor Author

AntonOresten commented May 1, 2026

While I am convinced :baseline as the default is correct for backward-compatibility (in CUDA.jl) and forward-compatiblity (in PTX?), @cuda feature_set=:architecture is quite verbose, especially considering @cuda only takes arguments on a single line. Perhaps it can be made to take blocks (see JuliaGPU/cuTile.jl#40 (comment)), or shortening :architecture -> :arch.

EDIT: Shortened :architecture to :arch.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 16.40%. Comparing base (8288a3a) to head (1faeb7e).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3124   +/-   ##
=======================================
  Coverage   16.40%   16.40%           
=======================================
  Files         124      124           
  Lines        9827     9827           
=======================================
  Hits         1612     1612           
  Misses       8215     8215           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@maleadt
Copy link
Copy Markdown
Member

maleadt commented May 13, 2026

It's unfortunate that this splits the knowledge about architectures in multiple places; before it was only in the databases of compatibility.jl.

Builds on #3120; PTXCompilerTarget has no field for the suffix, so we rewrite .target and pass the matching --gpu-name to ptxas, sidestepping LLVM's NVPTX coverage.

I'm not sure this is reasonable. NVPTX does actually have subtarget-specific emission paths, see e.g. hasConvertWithStochasticRounding: https://github.com/llvm/llvm-project/blob/477c59bb404af185976a23bc262dbeb09f788bff/llvm/lib/Target/NVPTX/NVPTXSubtarget.h#L305-L307

@AntonOresten
Copy link
Copy Markdown
Contributor Author

NVPTX does actually have subtarget-specific emission paths

Right, I see your concern:

  • LLVM is currently told e.g. sm_103, and emits PTX using only the baseline subtarget features. That PTX is, by superset, valid for sm_103a too.
  • .target rewrite + --gpu-name sm_103a: tells ptxas "treat this stream as sm_103a". ptxas then accepts arch-specific instructions (but they can only appear via inline asm, since LLVM didn't emit any) and assembles for that exact CC.

So this PR unblocks the inline-PTX use case, but to get the LLVM-side benefits we'd want to thread the feature set into PTXCompilerTarget so the TargetMachine is constructed with sm_103a, which flips hasArchAccelFeatures() and lets the guarded patterns match.

I started at the CUDA.jl level because my GPUCompiler.jl mental model isn't deep enough yet to confidently extend PTXCompilerTarget.

Happy to scope this PR to the inline-PTX case and file a follow-up against GPUCompiler.jl for the PTXCompilerTarget change.

It's unfortunate that this splits the knowledge about architectures in multiple places; before it was only in the databases of compatibility.jl.

The only suffix-aware code that ended up elsewhere is format_target (suffix -> sm_NNN[a|f] string) in compilation.jl. Could move that over too if it helps.

@maleadt
Copy link
Copy Markdown
Member

maleadt commented May 19, 2026

Here's the GPUCompiler.jl bit: JuliaGPU/GPUCompiler.jl#798

@maleadt maleadt self-assigned this May 19, 2026
@maleadt maleadt changed the title Add feature_set kwarg for selecting PTX target suffix Add support for family- and architecture-specific compute capabilities May 19, 2026
@maleadt
Copy link
Copy Markdown
Member

maleadt commented May 19, 2026

Reworked significantly. On the back-end side things are now properly passed down to GPUCompiler.jl, but I also tried to make things more idiomatic on the front-end side. There's a sm"..." string-macro now to create version numbers with a feature set attached to them, and we should correctly select architecture-specific capabilities automatically (which includes handling a bunch of corner cases).

Comment thread CUDACore/src/compiler/sm.jl Outdated
Comment thread CUDACore/src/compiler/sm.jl Outdated
Comment thread CUDACore/src/compiler/execution.jl Outdated
@AntonOresten
Copy link
Copy Markdown
Contributor Author

and we should correctly select architecture-specific capabilities automatically

I initially chose to be conservative, but you're sure this won't compromise any downstream assumptions / caches if it becomes the default?

@maleadt
Copy link
Copy Markdown
Member

maleadt commented May 19, 2026

I don't think so.

@AntonOresten AntonOresten changed the title Add support for family- and architecture-specific compute capabilities Add support for family- and architecture-specific features May 19, 2026
@maleadt
Copy link
Copy Markdown
Member

maleadt commented May 19, 2026

One other thing we need to consider is how to use this for conditional code paths (if we even want to support this conditionality at the host level). For example, e5m2 etc are supported by sm_120a, and not the baseline architecture. Right now we'd be doing if capability(device()) >= v"12.0" do ... end, which doesn't extend to this design because the a suffix is a codegen modifier, not a device property.

You can query it from within the kernel, branching on target_feature_set(), but for type support we can't delay the decision that long.

@AntonOresten
Copy link
Copy Markdown
Contributor Author

You can query it from within the kernel

So that'd be a device function like CUDACore.compute_capability?

if we ever want to support this conditionality at the host level

If the user is branching on compute capability, they're controlling the kernel-launching? So they should be able to control the feature set as well. Since the feature set is now tied to the compute capability in the cap or arch argument, how would I change the target to e.g. family-specific? arch=SMVersion(capability(device()), :family)?

@maleadt
Copy link
Copy Markdown
Member

maleadt commented May 19, 2026

You can query it from within the kernel

So that'd be a device function like CUDACore.compute_capability?

Yes, and it's already in here (supported by the GPUCompiler.jl change).

if we ever want to support this conditionality at the host level

If the user is branching on compute capability, they're controlling the kernel-launching? So they should be able to control the feature set as well. Since the feature set is now tied to the compute capability in the cap or arch argument, how would I change the target to e.g. family-specific? arch=SMVersion(capability(device()), :family)?

Yes, I'm not thinking of people doing @cuda, but us implementing abstractions in CUDA.jl. That's the case where you want to write code that switches kernels based on what's possible to emit. I guess we could add an accessor for the current compiler config, which would return a SMVersion as opposed to a simple VersionNumber representing the capability.

@AntonOresten
Copy link
Copy Markdown
Contributor Author

Are there any cases in which we don't want architecture-specific codegen, anyway? Linking my earlier comment

@maleadt
Copy link
Copy Markdown
Member

maleadt commented May 19, 2026

When the toolchain doesn't support it?

@maleadt
Copy link
Copy Markdown
Member

maleadt commented May 19, 2026

Actually, I guess we can already do:

  if runs_on(sm"120a", capability(device()))
      @cuda arch=sm"120a" specialized_kernel(x)
  else
      @cuda fallback_kernel(x)
  end

So won't add another abstraction for now.

[only docs]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants