Add support for family- and architecture-specific features by AntonOresten · Pull Request #3124 · JuliaGPU/CUDA.jl

AntonOresten · 2026-04-30T20:06:05Z

Adds support for family-specific (sm_NNNf) and architecture-specific (sm_NNNa) PTX targets, enabling access to a wider set of low-level instructions. Builds on #3120; PTXCompilerTarget has no field for the suffix, so we rewrite .target and pass the matching --gpu-name to ptxas, sidestepping LLVM's NVPTX coverage.

NVIDIA defines three feature sets:

baseline (no suffix, e.g. sm_90): forward-compatible.
family (f suffix, e.g. sm_100f): same-major-family-portable. Requires CC ≥ 10.0 and PTX ≥ 8.8.
architecture (a suffix, e.g. sm_90a): locked to one exact CC. Requires CC ≥ 9.0 and PTX ≥ 8.0.

with hierarchy baseline ⊆ family ⊆ architecture.

The previous behavior remains as the default through :baseline. To unblock wgmma, tcgen05, and friends, explicit opt-in is required through a new feature_set kwarg on cufunction, @cuda, etc. that takes a Symbol.

Relevant docs:

Supercedes #3122 (renamed branch; seems beyond rescue)

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `1cedc63`	Previous: `0bd53ac`	Ratio
`array/accumulate/Float32/1d`	`101017` ns	`100120` ns	`1.01`
`array/accumulate/Float32/dims=1`	`76644` ns	`75778` ns	`1.01`
`array/accumulate/Float32/dims=1L`	`1586294` ns	`1577976` ns	`1.01`
`array/accumulate/Float32/dims=2`	`144293.5` ns	`141092` ns	`1.02`
`array/accumulate/Float32/dims=2L`	`658553.5` ns	`653288` ns	`1.01`
`array/accumulate/Int64/1d`	`118892` ns	`117448` ns	`1.01`
`array/accumulate/Int64/dims=1`	`80426` ns	`79060` ns	`1.02`
`array/accumulate/Int64/dims=1L`	`1695383` ns	`1684513` ns	`1.01`
`array/accumulate/Int64/dims=2`	`156810` ns	`153292` ns	`1.02`
`array/accumulate/Int64/dims=2L`	`962994` ns	`959566` ns	`1.00`
`array/broadcast`	`20664` ns	`20150` ns	`1.03`
`array/construct`	`1284.3` ns	`1237.1` ns	`1.04`
`array/copy`	`18199` ns	`17027` ns	`1.07`
`array/copyto!/cpu_to_gpu`	`217475` ns	`215009` ns	`1.01`
`array/copyto!/gpu_to_cpu`	`287145` ns	`282762` ns	`1.02`
`array/copyto!/gpu_to_gpu`	`11012` ns	`10609` ns	`1.04`
`array/iteration/findall/bool`	`135241` ns	`132391` ns	`1.02`
`array/iteration/findall/int`	`149099` ns	`146933` ns	`1.01`
`array/iteration/findfirst/bool`	`82094` ns	`80656` ns	`1.02`
`array/iteration/findfirst/int`	`84643` ns	`81676` ns	`1.04`
`array/iteration/findmin/1d`	`85238.5` ns	`67518` ns	`1.26`
`array/iteration/findmin/2d`	`114471` ns	`112075` ns	`1.02`
`array/iteration/logical`	`203286.5` ns	`193238` ns	`1.05`
`array/iteration/scalar`	`68624` ns	`64768` ns	`1.06`
`array/permutedims/2d`	`52323.5` ns	`49922` ns	`1.05`
`array/permutedims/3d`	`52569.5` ns	`50310` ns	`1.04`
`array/permutedims/4d`	`51751` ns	`49840` ns	`1.04`
`array/random/rand/Float32`	`12736` ns	`11702` ns	`1.09`
`array/random/rand/Int64`	`24752` ns	`22663` ns	`1.09`
`array/random/rand!/Float32`	`9981` ns	`7893.333333333333` ns	`1.26`
`array/random/rand!/Int64`	`21662` ns	`18003` ns	`1.20`
`array/random/randn/Float32`	`38916.5` ns	`36437` ns	`1.07`
`array/random/randn!/Float32`	`27603.5` ns	`24279` ns	`1.14`
`array/reductions/mapreduce/Float32/1d`	`34573` ns	`34273` ns	`1.01`
`array/reductions/mapreduce/Float32/dims=1`	`40101.5` ns	`38242` ns	`1.05`
`array/reductions/mapreduce/Float32/dims=1L`	`51208` ns	`50540` ns	`1.01`
`array/reductions/mapreduce/Float32/dims=2`	`58049` ns	`55899` ns	`1.04`
`array/reductions/mapreduce/Float32/dims=2L`	`67815` ns	`67329` ns	`1.01`
`array/reductions/mapreduce/Int64/1d`	`42907` ns	`40432` ns	`1.06`
`array/reductions/mapreduce/Int64/dims=1`	`42645` ns	`41297` ns	`1.03`
`array/reductions/mapreduce/Int64/dims=1L`	`87220` ns	`86612` ns	`1.01`
`array/reductions/mapreduce/Int64/dims=2`	`60970` ns	`58190` ns	`1.05`
`array/reductions/mapreduce/Int64/dims=2L`	`84526.5` ns	`82637` ns	`1.02`
`array/reductions/reduce/Float32/1d`	`34763` ns	`34228` ns	`1.02`
`array/reductions/reduce/Float32/dims=1`	`49574` ns	`38235` ns	`1.30`
`array/reductions/reduce/Float32/dims=1L`	`51443` ns	`50426` ns	`1.02`
`array/reductions/reduce/Float32/dims=2`	`58566` ns	`55757` ns	`1.05`
`array/reductions/reduce/Float32/dims=2L`	`68653` ns	`67689` ns	`1.01`
`array/reductions/reduce/Int64/1d`	`42918` ns	`40556` ns	`1.06`
`array/reductions/reduce/Int64/dims=1`	`52148` ns	`41261` ns	`1.26`
`array/reductions/reduce/Int64/dims=1L`	`87245` ns	`86664` ns	`1.01`
`array/reductions/reduce/Int64/dims=2`	`60850` ns	`58234` ns	`1.04`
`array/reductions/reduce/Int64/dims=2L`	`84046` ns	`82724` ns	`1.02`
`array/reverse/1d`	`17936.5` ns	`15928` ns	`1.13`
`array/reverse/1dL`	`68528` ns	`67783` ns	`1.01`
`array/reverse/1dL_inplace`	`65843` ns	`65337` ns	`1.01`
`array/reverse/1d_inplace`	`10385.333333333334` ns	`8321` ns	`1.25`
`array/reverse/2d`	`20883` ns	`20259` ns	`1.03`
`array/reverse/2dL`	`72880` ns	`72140` ns	`1.01`
`array/reverse/2dL_inplace`	`65901` ns	`65256` ns	`1.01`
`array/reverse/2d_inplace`	`10298` ns	`9782` ns	`1.05`
`array/sorting/1d`	`2736103` ns	`2723355` ns	`1.00`
`array/sorting/2d`	`1070242.5` ns	`1067801` ns	`1.00`
`array/sorting/by`	`3305342` ns	`3303323` ns	`1.00`
`cuda/synchronization/context/auto`	`1183.2` ns	`1174.7` ns	`1.01`
`cuda/synchronization/context/blocking`	`921.2162162162163` ns	`933.71875` ns	`0.99`
`cuda/synchronization/context/nonblocking`	`7714` ns	`6032` ns	`1.28`
`cuda/synchronization/stream/auto`	`1045.4` ns	`1029.5` ns	`1.02`
`cuda/synchronization/stream/blocking`	`839.8048780487804` ns	`834.7297297297297` ns	`1.01`
`cuda/synchronization/stream/nonblocking`	`7304.4` ns	`5867.2` ns	`1.24`
`integration/byval/reference`	`143899` ns	`143373` ns	`1.00`
`integration/byval/slices=1`	`145685` ns	`145502` ns	`1.00`
`integration/byval/slices=2`	`284492.5` ns	`283967` ns	`1.00`
`integration/byval/slices=3`	`422937` ns	`422551` ns	`1.00`
`integration/cudadevrt`	`102390` ns	`101953` ns	`1.00`
`integration/volumerhs`	`11296274.5` ns	`9741344` ns	`1.16`
`kernel/indexing`	`13435` ns	`12949` ns	`1.04`
`kernel/indexing_checked`	`14130` ns	`13465` ns	`1.05`
`kernel/launch`	`2198.4444444444443` ns	`2139.222222222222` ns	`1.03`
`kernel/occupancy`	`701.972972972973` ns	`693.993670886076` ns	`1.01`
`kernel/rand`	`16090` ns	`16122` ns	`1.00`
`latency/import`	`3847676258.5` ns	`3844133780` ns	`1.00`
`latency/precompile`	`4643091314` ns	`4620836705` ns	`1.00`
`latency/ttfp`	`4517908107.5` ns	`4425510838` ns	`1.02`

This comment was automatically generated by workflow using github-action-benchmark.

AntonOresten · 2026-05-01T11:32:00Z

While I am convinced :baseline as the default is correct for backward-compatibility (in CUDA.jl) and forward-compatiblity (in PTX?), @cuda feature_set=:architecture is quite verbose, especially considering @cuda only takes arguments on a single line. Perhaps it can be made to take blocks (see JuliaGPU/cuTile.jl#40 (comment)), or shortening :architecture -> :arch.

EDIT: Shortened :architecture to :arch.

codecov · 2026-05-01T17:26:43Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 16.40%. Comparing base (8288a3a) to head (1faeb7e).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3124   +/-   ##
=======================================
  Coverage   16.40%   16.40%           
=======================================
  Files         124      124           
  Lines        9827     9827           
=======================================
  Hits         1612     1612           
  Misses       8215     8215

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

maleadt · 2026-05-13T15:38:26Z

It's unfortunate that this splits the knowledge about architectures in multiple places; before it was only in the databases of compatibility.jl.

Builds on #3120; PTXCompilerTarget has no field for the suffix, so we rewrite .target and pass the matching --gpu-name to ptxas, sidestepping LLVM's NVPTX coverage.

I'm not sure this is reasonable. NVPTX does actually have subtarget-specific emission paths, see e.g. hasConvertWithStochasticRounding: https://github.com/llvm/llvm-project/blob/477c59bb404af185976a23bc262dbeb09f788bff/llvm/lib/Target/NVPTX/NVPTXSubtarget.h#L305-L307

AntonOresten · 2026-05-13T17:02:50Z

NVPTX does actually have subtarget-specific emission paths

Right, I see your concern:

LLVM is currently told e.g. sm_103, and emits PTX using only the baseline subtarget features. That PTX is, by superset, valid for sm_103a too.
.target rewrite + --gpu-name sm_103a: tells ptxas "treat this stream as sm_103a". ptxas then accepts arch-specific instructions (but they can only appear via inline asm, since LLVM didn't emit any) and assembles for that exact CC.

So this PR unblocks the inline-PTX use case, but to get the LLVM-side benefits we'd want to thread the feature set into PTXCompilerTarget so the TargetMachine is constructed with sm_103a, which flips hasArchAccelFeatures() and lets the guarded patterns match.

I started at the CUDA.jl level because my GPUCompiler.jl mental model isn't deep enough yet to confidently extend PTXCompilerTarget.

Happy to scope this PR to the inline-PTX case and file a follow-up against GPUCompiler.jl for the PTXCompilerTarget change.

It's unfortunate that this splits the knowledge about architectures in multiple places; before it was only in the databases of compatibility.jl.

The only suffix-aware code that ended up elsewhere is format_target (suffix -> sm_NNN[a|f] string) in compilation.jl. Could move that over too if it helps.

…type

maleadt · 2026-05-19T05:03:34Z

Here's the GPUCompiler.jl bit: JuliaGPU/GPUCompiler.jl#798

maleadt · 2026-05-19T11:09:50Z

Reworked significantly. On the back-end side things are now properly passed down to GPUCompiler.jl, but I also tried to make things more idiomatic on the front-end side. There's a sm"..." string-macro now to create version numbers with a feature set attached to them, and we should correctly select architecture-specific capabilities automatically (which includes handling a bunch of corner cases).

AntonOresten · 2026-05-19T12:11:58Z

and we should correctly select architecture-specific capabilities automatically

I initially chose to be conservative, but you're sure this won't compromise any downstream assumptions / caches if it becomes the default?

maleadt · 2026-05-19T12:41:48Z

I don't think so.

maleadt · 2026-05-19T15:42:56Z

One other thing we need to consider is how to use this for conditional code paths (if we even want to support this conditionality at the host level). For example, e5m2 etc are supported by sm_120a, and not the baseline architecture. Right now we'd be doing if capability(device()) >= v"12.0" do ... end, which doesn't extend to this design because the a suffix is a codegen modifier, not a device property.

You can query it from within the kernel, branching on target_feature_set(), but for type support we can't delay the decision that long.

AntonOresten · 2026-05-19T16:25:42Z

You can query it from within the kernel

So that'd be a device function like CUDACore.compute_capability?

if we ever want to support this conditionality at the host level

If the user is branching on compute capability, they're controlling the kernel-launching? So they should be able to control the feature set as well. Since the feature set is now tied to the compute capability in the cap or arch argument, how would I change the target to e.g. family-specific? arch=SMVersion(capability(device()), :family)?

maleadt · 2026-05-19T16:30:40Z

You can query it from within the kernel

So that'd be a device function like CUDACore.compute_capability?

Yes, and it's already in here (supported by the GPUCompiler.jl change).

if we ever want to support this conditionality at the host level

If the user is branching on compute capability, they're controlling the kernel-launching? So they should be able to control the feature set as well. Since the feature set is now tied to the compute capability in the cap or arch argument, how would I change the target to e.g. family-specific? arch=SMVersion(capability(device()), :family)?

Yes, I'm not thinking of people doing @cuda, but us implementing abstractions in CUDA.jl. That's the case where you want to write code that switches kernels based on what's possible to emit. I guess we could add an accessor for the current compiler config, which would return a SMVersion as opposed to a simple VersionNumber representing the capability.

AntonOresten · 2026-05-19T16:39:16Z

Are there any cases in which we don't want architecture-specific codegen, anyway? Linking my earlier comment

maleadt · 2026-05-19T17:31:28Z

When the toolchain doesn't support it?

maleadt · 2026-05-19T19:58:35Z

Actually, I guess we can already do:

  if runs_on(sm"120a", capability(device()))
      @cuda arch=sm"120a" specialized_kernel(x)
  else
      @cuda fallback_kernel(x)
  end

So won't add another abstraction for now.

[only docs]

github-actions Bot reviewed Apr 30, 2026

View reviewed changes

AntonOresten force-pushed the feature-set branch from 2f3cea9 to a5bbd52 Compare May 1, 2026 15:42

AntonOresten force-pushed the feature-set branch from a5bbd52 to bcf825b Compare May 11, 2026 15:58

maleadt force-pushed the main branch from ce29a43 to d3fe605 Compare May 14, 2026 10:31

AntonOresten added 5 commits May 16, 2026 15:32

Add architecture- and family-specific PTX target support

8d65f63

Remove kind compiler config kwarg

9f254a2

Add feature_set compiler config kwarg; use symbols instead of enum …

440708d

…type

Require opt-in through feature_set to avoid breaking changes

74e8ddd

Shorten :architecture to :arch

1faeb7e

AntonOresten force-pushed the feature-set branch from bcf825b to 1faeb7e Compare May 16, 2026 13:32

maleadt added 3 commits May 19, 2026 08:38

Unify version databases again using a sm string.

fbfb253

Rework device compatibility handling.

9450ef7

Report on detected compiler support.

5e38582

maleadt self-assigned this May 19, 2026

maleadt added 2 commits May 19, 2026 12:25

Bump GPUCompiler.

2158f42

Improve test.

fba88cd

maleadt force-pushed the feature-set branch from 15841f9 to fba88cd Compare May 19, 2026 10:58

maleadt changed the title ~~Add feature_set kwarg for selecting PTX target suffix~~ Add support for family- and architecture-specific compute capabilities May 19, 2026

AntonOresten commented May 19, 2026

View reviewed changes

Comment thread CUDACore/src/compiler/sm.jl Outdated

Comment thread CUDACore/src/compiler/sm.jl Outdated

Comment thread CUDACore/src/compiler/execution.jl Outdated

maleadt added 3 commits May 19, 2026 14:53

Add docs.

8250be8

Switch to a more NVIDIA-like SM string.

aacd3b4

More cleanly separate device capability and codegen architecture.

17bd4c8

AntonOresten changed the title ~~Add support for family- and architecture-specific compute capabilities~~ Add support for family- and architecture-specific features May 19, 2026

maleadt added 2 commits May 19, 2026 16:13

Also support string-based version numbers with sm_ prefixes.

aa9ae2e

Drop overly specific tests.

1cedc63

Fix docs.

cf5b4b6

[only docs]

Conversation

AntonOresten commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

AntonOresten commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

maleadt commented May 13, 2026

Uh oh!

AntonOresten commented May 13, 2026

Uh oh!

maleadt commented May 19, 2026

Uh oh!

maleadt commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AntonOresten commented May 19, 2026

Uh oh!

maleadt commented May 19, 2026

Uh oh!

maleadt commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AntonOresten commented May 19, 2026

Uh oh!

maleadt commented May 19, 2026

Uh oh!

AntonOresten commented May 19, 2026

Uh oh!

maleadt commented May 19, 2026

Uh oh!

maleadt commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AntonOresten commented Apr 30, 2026 •

edited

Loading

github-actions Bot left a comment •

edited

Loading

AntonOresten commented May 1, 2026 •

edited

Loading

codecov Bot commented May 1, 2026 •

edited

Loading

maleadt commented May 19, 2026 •

edited

Loading