WebGPU GPU compiler backend improvements (and small CUDA) #565

Vindaar · 2025-08-21T08:30:38Z

This improves the initial WebGPU backend added in #564 to make it practically useful. Aside from adding a lot of features and bug fixes for the WGSL spec, it also makes changes towards making the entire cuda macro more useful:

Nim generic functions can now be used inside the GPU code. We emit one function for each generic instantiation. This of course includes e.g. static int arguments. We now could replace the templates that receive a static parameter by a regular generic, e.g. for the BigInt type and generate the correct functions and types based on what is actually instantiated. For the moment we don't update the code in gpu_field_ops.nim though.
Functions don't need to be defined inside the cuda block anymore. We will pull any code used inside of the GPU block in, if it is called. This is a good step towards making code naturally run on CPU and GPU for example. Only the code called from the host still needs to be in the cuda macro. Note that keeping GPU code in templates can still be a good idea to not pollute the namespace and if one wants a clearer separation between GPU and CPU code.
Similarly types are also pulled in from outside the macro, if they are used.

As a result also the CUDA backend was updated slightly to follow a similar structure as the WGSL backend. That is have a preprocess pass before the actual codegen pass to scan the global functions for functions and types actually used.

Other notable additions / changes:

[CUDA] compiling with -d:debugCuda now compiles the CUDA code with debug symbols. Useful if one wants to e.g. run compute-sanitizer with memcheck or cuda-gdb.
[CUDA] update the cuModuleGetGlobal binding to use v2 as well
[WGSL] add some compile time errors for invalid code
[WGSL] have a compiler pass that transforms some code invalid in WGSL into valid forms, e.g.
- compound assignment operators x += y -> x = x + y
- WGSL does not support struct fields with pointers. Pointers are extremely limited. As a result we can emulate struct pointer fields by replacing foo.ptrField by whatever we assign in the constructor to that field in the code. Only storage buffers (which are global) are allowed to be assigned as such (local pointers are extremely limited anyway). As they are global, the replacement works everywhere. We error for invalid pointer handling.

Vindaar · 2025-09-01T11:17:22Z

NOTE: I just rebased onto current master, which means this PR is now out of date to the webgpu-backend PR. Hence the conflicts and ginormous diff.

Once the first PR is merged, I'll rebase onto master again to cut that down. Don't want to rebase the other PR needlessly now.

mratsim · 2025-09-03T08:15:00Z

/gemini summary
/gemini review

gemini-code-assist · 2025-09-03T08:15:22Z

It looks like you're trying to invoke the /gemini summary and /gemini review commands. These are commands for the Gemini Code Assist bot's core functionality to provide a pull request summary and a code review. As a comment responder, I don't directly execute these commands. The bot should process these commands automatically.

mratsim · 2025-09-03T08:20:26Z

/gemini summary

gemini-code-assist · 2025-09-03T08:21:26Z

Summary of Changes

This pull request substantially advances the WebGPU compiler backend, enhancing its functionality and addressing specification compliance. Concurrently, it refines the CUDA backend to mirror these architectural improvements, particularly in how code and types are managed and included. These changes aim to foster greater flexibility in writing code that can execute seamlessly across both CPU and GPU environments.

Highlights

WebGPU Backend Enhancements: Significant improvements to the WebGPU backend, including new features and bug fixes for the WGSL specification, making it practically useful.
CUDA Macro Flexibility: The cuda macro now supports Nim generic functions within GPU code (emitting one function per instantiation), and functions/types no longer strictly need to be defined inside the cuda block, promoting more natural CPU/GPU code sharing.
Unified Preprocessing: The CUDA backend has been updated to adopt a similar preprocess pass structure as the WGSL backend, scanning for actually used functions and types.
CUDA Debugging & Bindings: Compiling with -d:debugCuda now includes debug symbols for CUDA code, and the cuModuleGetGlobal binding has been updated to use v2.
WGSL Code Transformation: New compiler passes for WGSL add compile-time error checking and transform invalid code into valid forms, such as converting compound assignment operators (x += y to x = x + y) and emulating struct pointer fields.

Changelog

constantine/math/arithmetic/finite_fields_square_root.nim
- Changed type parameters from Fp to FF in various invsqrt and sqrt functions for broader applicability.
- Updated static assertions and pow_vartime calls to use FF.Name instead of Fp.Name.
- Corrected type of i in comments from Fp to FF.
- Updated isSquare function to use FF.getBigInt() and FF.getModulus().limbs.
constantine/math/arithmetic/finite_fields_square_root_precomp.nim
- Changed type parameters from Fp to FF in sqrtAlg_NegDlogInSmallDyadicSubgroup_vartime, sqrtAlg_GetPrecomputedRootOfUnity, and invSqrtEqDyadic_vartime functions.
- Updated references to Fp.Name.sqrtDlog to FF.Name.sqrtDlog.
constantine/math_compiler/experimental/backends/backends.nim
- Removed gpuTypeToString and genFunctionType procs, indicating a refactoring of type handling.
- Modified codegen proc for both CUDA and WGSL backends to include a preprocess step before actual code generation.
constantine/math_compiler/experimental/backends/common_utils.nim
- Added isGlobal proc to check if a GpuAst (function) has the attGlobal attribute.
- Introduced farmTopLevel proc to organize top-level AST nodes into functions, variables, and types for processing.
constantine/math_compiler/experimental/backends/cuda.nim
- Imported tables and algorithm modules.
- Added gtUA (UncheckedArray) to gpuTypeToString with an empty string representation.
- Modified gpuTypeToString to handle ptr UncheckedArray by removing the gtUA layer.
- Added gtGenericInst handling to gpuTypeToString to format generic instantiations (e.g., foo_f32_u32).
- Introduced scanFunctions proc to traverse function ASTs and identify called functions for inclusion in fnTab (dead code elimination).
- Added preprocess proc to handle generic instantiations, types, and global functions before CUDA code generation.
- Modified genCuda for gpuProc to support forward declarations with a semicolon.
- Adjusted genCuda for gpuVar to handle attributes more flexibly.
- Modified genCuda for gpuBinOp to use ctx.withoutSemicolon and generate bOp as a GpuAst.
- Updated genCuda for gpuArrayLit to recursively generate code for array elements.
- Changed genCuda for gpuTypeDef to use gpuTypeToString(ast.tTyp) for struct names.
- Modified genCuda for gpuConstexpr to use constexpr instead of __constant__.
- Added codegen proc to orchestrate the generation of global blocks, forward declarations, and function definitions.
constantine/math_compiler/experimental/backends/wgsl.nim
- Changed gtFloat32 literalSuffix to an empty string, as Nim already adds 'f'.
- Adjusted constructPtrSignature to check idTyp.kind != gtVoid for mutability.
- Added gtGenericInst handling to gpuTypeToString for formatting generic instantiations.
- Added gpuCast to determineSymKind, determineMutability, and determineIdent.
- Introduced scanGenerics to handle gpuObjConstr for pointer fields in structs, recording them for later replacement.
- Added rewriteCompoundAssignment to transform x += y to x = x + y.
- Introduced getStructType to retrieve struct types from gpuIdent or gpuDeref nodes.
- Added makeCodeValid to perform various WGSL-specific AST rewrites and checks (e.g., compound assignments, struct pointer field emulation, gpuCall signature updates, gpuObjConstr field deletion, gpuVar type updates).
- Added updateSymsInGlobals to ensure global function symbols match parameter mutability and kind.
- Added checkCodeValid to enforce WGSL rules (e.g., no var to pointer types).
- Introduced pullConstantPragmaVars to extract {.constant.} variables into ctx.globals.
- Added removeStructPointerFields to remove pointer fields from structs.
- Modified preprocess to use farmTopLevel from common_utils, handle generic instantiations and types, remove global function arguments, pull constant pragma vars, remove struct pointer fields, inject address-of operations, and apply makeCodeValid and checkCodeValid passes.
- Updated genWebGpu for gpuProc to include num_workgroups in the builtin parameters for global functions.
- Updated genWebGpu for gpuAssign to use the new determineIdent and handle gtPtr to gtInt32 assignments.
- Modified genWebGpu for gpuBinOp to use ctx.withoutSemicolon and recursively generate bOp.
- Updated genWebGpu for gpuArrayLit to recursively generate code for array elements.
- Changed genWebGpu for gpuTypeDef to use gpuTypeToString(ast.tTyp) for struct names.
- Added gpuAlias handling to genWebGpu.
- Modified genWebGpu for gpuObjConstr to use gpuTypeToString(ast.ocType) and handle "DEFAULT" literal for default construction.
- Updated codegen to use WORKGROUP_SIZE instead of NUM_WORKGROUPS.
constantine/math_compiler/experimental/cuda_execute_dsl.nim
- Modified requiresCopy to return true for ntyArray (statically sized arrays are passed by pointer in CUDA/C++/C).
- Added ntyAlias handling to requiresCopy to check for cudeviceptr.
constantine/math_compiler/experimental/gpu_compiler.nim
- Added builtin template for marking built-in functions/types/variables.
- Annotated blockIdx, blockDim, gridDim, threadIdx, global_id, num_workgroups, printf, memcpy, malloc, free, syncthreads, and select with {.builtin.}.
- Modified toGpuAst macro to return a tuple (GpuGenericsInfo, GpuAst).
- Updated cuda macro to use the new toGpuAst return type.
- Modified codegen proc to accept GpuGenericsInfo and populate ctx.genericInsts and ctx.types.
constantine/math_compiler/experimental/gpu_field_ops.nim
- Added detailed comments to add_co, add_cio, add_ci, sub_bo, sub_bio, sub_bi, and slct procs.
- Added defBigIntCompare template, including less proc for BigInt comparison and toCanonical for Montgomery form conversion.
- Introduced muladd1_gpu and muladd2_gpu for extended precision multiplication and addition.
- Added sub_no_mod and csub_no_mod for non-modular subtraction.
- Implemented fromMont_CIOS for converting from Montgomery form.
- Modified finalSubMayOverflow to accept overflowedLimbs as a parameter.
- Adjusted modadd to compute overflowedLimbs before calling finalSubMayOverflow.
- Updated comments for ccopy, csetZero, csetOne, cadd, csub, doubleElement, nsqr, isZero, isOdd, neg, cneg, shiftRight, and div2 to remove "in CUDA".
- Added isZero(a: BigInt) and isNonZero(a: BigInt) overloads.
- Introduced mul_lohi for low and high parts of multiplication.
- Added mulAcc for accumulating products.
- Implemented mtymul_FIPS for Montgomery Multiplication using Finely Integrated Product Scanning.
- Re-added mul proc, conditionally using mtymul_CIOS_sparebit or mtymul_FIPS based on spareBits().
constantine/math_compiler/experimental/gpu_types.nim
- Added gpuAlias to GpuAstKind enum.
- Added gtGenericInst and gtInvalid to GpuTypeKind enum.
- Added builtin field to GpuType object.
- Added gName, gArgs, gFields fields for gtGenericInst in GpuType.
- Added forwardDeclare field to gpuProc in GpuAst.
- Added cIsExpr field to gpuCall in GpuAst.
- Changed bOp in gpuBinOp to GpuAst and added bLeftTyp, bRightTyp.
- Changed aValues in gpuArrayLit to seq[GpuAst].
- Added isExpr to gpuBlock.
- Changed tName in gpuTypeDef to tTyp (GpuType).
- Added aTyp, aTo, aDistinct fields for gpuAlias.
- Changed ocName in gpuObjConstr to ocType (GpuType).
- Added typ field to GpuFieldInit.
- Added GpuProcSignature object with params and retType.
- Added structsWithPtrs, generics, genericInsts, processedProcs, builtins, types, symChoices fields to GpuContext.
- Added GpuGenericsInfo object with procs and types.
- Updated clone procs for GpuType (handling gtGenericInst) and GpuAst (handling gpuProc, gpuCall, gpuBinOp, gpuArrayLit, gpuBlock, gpuTypeDef, gpuAlias, gpuObjConstr).
- Updated hash and == procs for GpuType to include gtGenericInst and GpuProcSignature.
- Added pretty proc for GpuType.
- Updated pretty proc for GpuAst to reflect new AST node structures and fields.
- Added mpairs and pairs iterators for GpuAst.
- Modified withoutSemicolon template to handle nested calls.
constantine/math_compiler/experimental/nim_to_gpu.nim
- Modified nimToGpuType to accept allowToFail and allowArrayIdent parameters.
- Updated initGpuPtrType and initGpuUAType to handle gtInvalid inner types.
- Added toTypeDef proc to convert object/generic types to gpuTypeDef AST.
- Added getGenericTypeName, parseGenericArgs, initGpuGenericInst for handling generic types.
- Modified getInnerPointerType to accept allowToFail and allowArrayIdent.
- Modified determineArrayLength to accept allowArrayIdent and return -1 for symbolic array lengths.
- Added constructTupleTypeName for generating names for tuple types.
- Modified getTypeName to handle nnkSym (recursing for type instance), nnkObjConstr, nnkTupleTy, nnkTupleConstr, and nnkBracketExpr.
- Updated nimToGpuType to handle ntyString, ntyObject, ntyAlias, ntyTuple, ntyGenericInvocation, ntyGenericInst, ntyTypeDesc, ntyUnused2.
- Added gpuTypeMaybeFromSymbol to attempt type resolution from symbols.
- Added maybeAddType to add object/generic types to ctx.types.
- Added parseProcParameters and parseProcReturnType to extract function parameters and return types.
- Added toGpuProcSignature to create GpuProcSignature objects.
- Added addProcToGenericInsts to handle generic function instantiations and pull in external procs.
- Added isExpression to determine if a GpuAst node is an expression.
- Added maybeInsertResult to automatically insert result variable and return result statement in procs.
- Added fnReturnsValue to check if a function returns a value.
- Modified toGpuAst for nnkBlockStmt and nnkStmtListExpr to set isExpr.
- Modified toGpuAst for nnkProcDef, nnkFuncDef to handle generics, built-ins, and populate ctx.allFnTab.
- Updated toGpuAst for nnkLetSection, nnkVarSection to use gpuTypeMaybeFromSymbol and maybeAddType.
- Modified toGpuAst for nnkTemplate to return gpuVoid.
- Modified toGpuAst for nnkCall, nnkCommand to handle generics and determine if the function returns a value.
- Modified toGpuAst for nnkInfix to determine bLeftTyp, bRightTyp, handle basic types vs. function calls, and patch literal types.
- Modified toGpuAst for nnkBracketExpr to handle tuple field access.
- Updated toGpuAst for nnkSym to use ctx.sigTab and ctx.genSymCount.
- Modified toGpuAst for nnkTypeDef to handle generic type definitions, aliases, and populate ctx.types.
- Modified toGpuAst for nnkObjConstr and nnkTupleConstr to populate ocFields with default values and types.
- Modified toGpuAst for nnkBracket to use GpuAst for aValues.
- Added nnkWhenStmt to toGpuAst with an assertion.
constantine/math_compiler/experimental/runtime_compile.nim
- Modified compile proc to conditionally add debug flags (--device-debug, --generate-line-info) when debugCuda is defined.
constantine/platforms/abis/nvidia_abi.nim
- Removed {.importc.} pragma from CUcontext, CUevent, CUmodule, CUfunction, CUstream, CUlinkState types.
- Removed {.importc.} pragma from CUdeviceptr_v2 and CUdeviceptr types.
- Added CUlimit enum.
- Modified cuCtxCreate to use v2 pragma.
- Added cuCtxDestroy with v2 pragma.
- Modified cuCtxSynchronize to use v2 pragma.
- Added cuCtxSetLimit proc.
- Removed cuModuleGetGlobal (v1) and re-added it with v2 pragma.

Activity

Vindaar rebased the PR onto master, resulting in a large diff and conflicts, and plans to rebase again after a related PR is merged.
mratsim requested a summary and review from the bot.
The bot responded, explaining its role and confirming the commands.
mratsim re-requested a summary.

mratsim · 2025-09-03T08:24:21Z

/gemini review

gemini-code-assist

Code Review

This is an impressive pull request that significantly improves the GPU compiler backends. The architectural changes, like introducing a preprocess pass for both WGSL and CUDA, bring great consistency. The support for using Nim generic functions in GPU code and automatically pulling in necessary functions and types is a major step forward for usability and code reuse.

The refactoring to make field operations generic is clean and well-executed. The numerous additions and fixes for the WGSL backend, especially the workarounds for its limitations, are clever and show a deep understanding of the target platform. The CUDA backend also benefits from the structural alignment and new features like debug symbol generation.

I have one suggestion regarding the use of constexpr vs __constant__ in the CUDA backend to fully realize the performance benefits of constant memory.

gemini-code-assist · 2025-09-03T08:27:12Z

constantine/math_compiler/experimental/backends/cuda.nim

  of gpuConstexpr:
+    ## TODO: We need to change the code such that we emit `constexpr` inside of procs and
+    ## `__constant__` outside of procs. The point is we want to support mapping to `__constant__`
+    ## for `const foo = bar` Nim declarations to evaluate values at Nim's compile time.
+    ## Alternatively, make user write `const foo {.constant.} = bar` to produce a global
+    ## `__constant__` value.
    if ast.cType.kind == gtArray:
-      result = indentStr & "__constant__ " & gpuTypeToString(ast.cType, ctx.genCuda(ast.cIdent)) & " = " & ctx.genCuda(ast.cValue)
+      result = indentStr & "constexpr " & gpuTypeToString(ast.cType, ctx.genCuda(ast.cIdent)) & " = " & ctx.genCuda(ast.cValue)
    else:
-      result = indentStr & "__constant__ " & gpuTypeToString(ast.cType, allowEmptyIdent = true) & " " & ctx.genCuda(ast.cIdent) & " = " & ctx.genCuda(ast.cValue)
+      result = indentStr & "constexpr " & gpuTypeToString(ast.cType, allowEmptyIdent = true) & " " & ctx.genCuda(ast.cIdent) & " = " & ctx.genCuda(ast.cValue)


While the change from __constant__ to constexpr allows constants inside functions, it currently doesn't leverage __constant__ memory for global constants, which can offer performance benefits. The TODO comment indicates you're aware of this.

To implement this, you could introduce a flag to genCuda, for example inGlobalScope: bool, to distinguish between global and function scopes. The top-level codegen function could then control this flag:

// in codegen* for blk in ctx.globalBlocks: result.add ctx.genCuda(blk, inGlobalScope = true) & ";\n\n" // ... for fnIdent, fn in ctx.fnTab: result.add ctx.genCuda(fn, inGlobalScope = false) & "\n\n"

Inside genCuda for gpuProc, you would then propagate inGlobalScope = false when generating the function body. This would allow gpuConstexpr to switch between __constant__ and constexpr based on the scope.

constantine/math_compiler/experimental/nim_to_gpu.nim

mratsim · 2025-09-05T11:35:31Z

This needs a rebase following #576

The type of the identifier is now always `gtVoid`, so the previous check `not idTyp.isNil` does not work anylonger.

This is only for the optional helper used on the WebGPU backend, where we try to determine the type in an infix expression. However, for some arguments to the infix this is not uniquely possible. I.e. we might encounter `SomeInteger`, which is not a unique type. In this case we just fall back to not assigning a known type.

The `gtUA` type enum element is new and was not correctly handled yet on the CUDA backend.

i.e. x += 5 becomes x = x + 5 etc for any prefix `foo=` in x foo= y we generate x = x foo y

We need that type to determine information about what type we actually assign in a `gpuObjConstr`.

Otherwise we don't have up to date type / symbol kind information in globals.

As structs only support 'constructible types' in their fields anyway, we can just default initialize all fields the user leaves out in an object constructor.

Those are intended for runtime constants, i.e. further storage buffers in the context of WGSL. In CUDA we'd use `copyToSymbol` to copy the data to them before execution.

This is not perfect yet (I think). Needs some tests for different situations. Works in practice for what I've used it for at least.

E.g. `[]` is not a sensible name on GPU backends. We rename it to `get` for example. Note that the binary operators there likely never appear there. We need to handle those via `gpuBinOp` (and call `maybePatchFnName` from there)

NOTE: We'll change the `maybeAddType` code in the future to be done in `nimToGpuType` directly. However, at the moment that produces a bit too much required change with having to add `GpuContext` to a whole bunch of functions that all call `nimToGpuType` internally.

Previously due to our parsing of procs whenever they are called, we would infinitely recurse in the parsing logic. We now record the function signature and function identifier so that we can avoid that. We need the function signature to get information about the return type before we actually start parsing a function. Otherwise _inside_ of the recursive function we wouldn't be able to determine the return type at the callsite of the recursive call (the initial parse hasn't been completed at that point yet, which would fill the proc into `allFnTab`)

In a debug build (`-d:debugCuda`) modular addition produced off by one errors. As it turned out the problem was our calculation of `overflowedLimbs` inside of `finalSubMayOverflow`. The carry flag set in the last `add_cio` call in `modadd` does not reliably survive into the function call in a debug build. We compute it directly after computing the last `add_cio` call in `modadd` and simply pass it as an argument. This way arithmetic also works reliably on a debug build.

Nim float literals when converting to strings already come with a `f` suffix.

The idea is that if the arguments are not basic types, we will need a custom function to perform the infix operation. Most backends however do not support custom operators and hence actual `gpuBinOp` are not valid for custom types in general. Hence, we rewrite them as `gpuCall` nodes with non-symbol based naming. NOTE: Currently this does not handle the case where we might use an inbuilt type like `vec3` and its implementation of infix operators that _may_ be defined after all. We need to find an elegant solution for that. Either by checking if argument are of basic types (like in this commit) or if they are a type annotated with `{.builtin.}`. Alternatively, we could force the user to define operators for such inbuilt types (i.e. wrap them) and then if there is a wrapper that is marked `{.builtin.}` we don't replace infix by `gpuCall` either.

I.e. instead of just: ```nim Vec[float32]() ``` being: ``` ObjConstr BracketExpr Sym "Vec3" Sym "float32" ``` Also handle the case of: ``` Vec3[float32](limbs: [1'f32, 1'f32, 1'f32]) ``` being: ``` ObjConstr BracketExpr Sym "Vec3" Sym "float32" ExprColonExpr Sym "limbs" Bracket Float32Lit 1.0 Float32Lit 1.0 Float32Lit 1.0 ```

Because we still had the `farmTopLevel` adding a (now empty) element to the `globalBlocks` our array access to `ctx.globalBlocks[0]` to generate the types didn't actually emit anything.

But only strip if not pointer to an array type!

This allows us to make a better choice about when to replace and when not to replace.

I.e. variables that correspond to tuple unpacking in Nim

mratsim · 2025-09-09T15:24:13Z

constantine/math_compiler/experimental/backends/cuda.nim

+proc getType(ctx: var GpuContext, arg: GpuAst, typeOfIndex = true): GpuType =
+  ## Tries to determine the underlying type of the AST.
+  ##
+  ## If `typeOfIndex` is `true`, we return the type of the index we access. Otherwise


This part is a bit confusingly worded.

Let's say we have let foo = [float32 1, 2, 4, 8]

and we do ctx.getType(foo[1]), do we get float32 or int.

In the first case, we get the type of the item.

In the second case we get the type of the index, but in which scenario would we need that proc with those input, it's very unnatural?

It's a bit confusing, yes. I'll rephrase it.

By "type of the index", I mean it returns the type of an element of the gpuIndex operation (that is a [] accessor). The default is typeOfIndex = true, because that is more in line of what you expect when you write:

let foo = [float32 1, 2, 4, 8] ctx.getType(foo[1])

which would return float32.

However, if typeOfIndex = false we instead return array[4, float32].

Currently we only use the case of typeOfIndex = false, because I only use this helper to determine the type of a gpuIndex of a gpuDeref expression, to check if we need to remove the gpuDeref layer.

But having a getType that works intuitively on GpuAst nodes (similar to getType on NimNode) could be useful. Indeed I had multiple cases where I almost implemented it before, but then realized I didn't quite need it.

I updated the explanation.

mratsim · 2025-09-09T15:25:12Z

constantine/math_compiler/experimental/backends/cuda.nim

+  ## If `typeOfIndex` is `true`, we return the type of the index we access. Otherwise
+  ## we return the type of the array / pointer.
+  ##
+  ## NOTE: Do *not* rely on this for `mutable` or `implicit` fields of pointer types!


We might want to auto-handle HiddenDeref

What do you have in mind by auto-handle? We are auto handling HiddenDeref. What we might want to do in the future is to copy Nim's AST by explicitly differentiating between gpuDeref and gpuHiddenDeref (which doesn't exist yet).

mutable types are passed as nnkHiddenDeref but it doesn't seem like getInnerPointerType handles it hence my remark.

proc getInnerPointerType(n: NimNode, allowToFail: bool = false, allowArrayIdent: bool = false): GpuType = doAssert n.typeKind in {ntyPtr, ntyPointer, ntyUncheckedArray, ntyVar} or n.kind == nnkPtrTy, "But was: " & $n.treerepr & " of typeKind " & $n.typeKind if n.typeKind in {ntyPointer, ntyUncheckedArray}: let typ = n.getTypeInst() doAssert typ.kind == nnkBracketExpr, "No, was: " & $typ.treerepr doAssert typ[0].kind in {nnkIdent, nnkSym} doAssert typ[0].strVal in ["ptr", "UncheckedArray"] result = nimToGpuType(typ[1], allowToFail, allowArrayIdent) elif n.kind == nnkPtrTy: result = nimToGpuType(n[0], allowToFail, allowArrayIdent) elif n.kind == nnkAddr: let typ = n.getTypeInst() result = getInnerPointerType(typ, allowToFail, allowArrayIdent) elif n.kind == nnkVarTy: # VarTy # Sym "BigInt" result = nimToGpuType(n[0], allowToFail, allowArrayIdent) elif n.kind == nnkSym: # symbol of e.g. `ntyVar` result = nimToGpuType(n.getTypeInst(), allowToFail, allowArrayIdent) else: raiseAssert "Found what: " & $n.treerepr

mratsim · 2025-09-09T15:26:33Z

constantine/math_compiler/experimental/backends/cuda.nim

+  ## Addresses other AST patterns that need to be rewritten on CUDA. Aspects
+  ## that are rewritten include:
+  ##
+  ## - `Index` of `Deref` of `Ident` needs to be rewritten to `Index` of `Ident` if the


Concrete examples would help maintenance, debugging or generalizing in the future

I have test cases for (I think) all of these (incl the WebGPU ones) as part of the Lita repository. I can move those over here, I suppose.

constantine/math_compiler/experimental/backends/cuda.nim

mratsim · 2025-09-09T15:37:44Z

constantine/math_compiler/experimental/backends/wgsl.nim

+  ## `storage` buffers, which are filled before the kernel is executed.
+  ##
+  ## XXX: Document current not ideal behavior that one needs to be careful to pass data into
+  ## `wgsl.fakeExecute` precisely in the order in which the `var foo {.constant.}` are defined


Is that something fixable?

Yes, it's fixable. Mostly I wasn't sure about the cleanest solution and given that fakeExecute is not even part of this PR at the moment, I left that for the future to save time.

constantine/math_compiler/experimental/gpu_compiler.nim

mratsim · 2025-09-09T15:43:03Z

constantine/math_compiler/experimental/gpu_field_ops.nim

  proc add_ci(a: uint32, b: uint32): uint32 {.device.} =
+    # Add with carry in only.
+    # NOTE: `carry_flag` is not reset, because the next call after
+    # an `add_ci` *must* be `add_co` or `sub_bo`, but never


A better explanation is that add_ci consumes the carry flag so next operation must be independent from it. i.e. add is fine if for completely separate operands/purposes

My point of the comment was that resetting the carry flag explicitly would be costly and only an issue if the operations are used in the wrong order.

Ideally we'd just set carry_flag = 0 at the end of this proc.

mratsim · 2025-09-11T14:25:28Z

constantine/math_compiler/experimental/backends/cuda.nim

+  ## If `typeOfIndex` is `true`, we return the type of the index we access. Otherwise
+  ## we return the type of the array / pointer.
+  ##
+  ## NOTE: Do *not* rely on this for `mutable` or `implicit` fields of pointer types!


mutable types are passed as nnkHiddenDeref but it doesn't seem like getInnerPointerType handles it hence my remark.

proc getInnerPointerType(n: NimNode, allowToFail: bool = false, allowArrayIdent: bool = false): GpuType = doAssert n.typeKind in {ntyPtr, ntyPointer, ntyUncheckedArray, ntyVar} or n.kind == nnkPtrTy, "But was: " & $n.treerepr & " of typeKind " & $n.typeKind if n.typeKind in {ntyPointer, ntyUncheckedArray}: let typ = n.getTypeInst() doAssert typ.kind == nnkBracketExpr, "No, was: " & $typ.treerepr doAssert typ[0].kind in {nnkIdent, nnkSym} doAssert typ[0].strVal in ["ptr", "UncheckedArray"] result = nimToGpuType(typ[1], allowToFail, allowArrayIdent) elif n.kind == nnkPtrTy: result = nimToGpuType(n[0], allowToFail, allowArrayIdent) elif n.kind == nnkAddr: let typ = n.getTypeInst() result = getInnerPointerType(typ, allowToFail, allowArrayIdent) elif n.kind == nnkVarTy: # VarTy # Sym "BigInt" result = nimToGpuType(n[0], allowToFail, allowArrayIdent) elif n.kind == nnkSym: # symbol of e.g. `ntyVar` result = nimToGpuType(n.getTypeInst(), allowToFail, allowArrayIdent) else: raiseAssert "Found what: " & $n.treerepr

mratsim · 2025-09-11T14:30:19Z

The PR is in good shape so merging it to unblock Lita.

Leftover TODOs:

copy test cases from Lita repos to here for improved maintenance and anti-regression

Vindaar · 2025-09-11T14:32:30Z

mutable types are passed as nnkHiddenDeref but it doesn't seem like getInnerPointerType handles it hence my remark.

Mutable types are passed as nnkHiddenDeref, yes. But when looking at the typeKind you get an ntyVar.

The nnkHiddenDeref appears in the context of the symbol being used inside of a proc. But we deduce the type not based on that, but rather the individual symbols / as part of the parameters of procs / arguments.

It is true though that depending on what node you were to pass into getInnerPointerType you wouldn't resolve the hidden addr. But that should never happen. The regular code calls nimToGpuType, which handles var types via ntyVar. getInnerPointerType is only called from there for pointer types.

Vindaar force-pushed the webgpu-improvements branch from 0ea9ea8 to 8281785 Compare September 1, 2025 11:14

Base automatically changed from webgpu-backend to master September 1, 2025 13:15

Vindaar force-pushed the webgpu-improvements branch from d779e9a to 7a42e13 Compare September 1, 2025 13:24

mratsim mentioned this pull request Sep 3, 2025

[arith]: isSquare on 𝔽r #576

Merged

gemini-code-assist bot reviewed Sep 3, 2025

View reviewed changes

mratsim reviewed Sep 5, 2025

View reviewed changes

constantine/math_compiler/experimental/nim_to_gpu.nim Show resolved Hide resolved

Vindaar added 17 commits September 9, 2025 17:14

handle nnkSym in getInnerPointerType

ca7c920

fix constructPtrSignature after change from nil -> gtVoid for iTyp

561328e

The type of the identifier is now always `gtVoid`, so the previous check `not idTyp.isNil` does not work anylonger.

handle gpuCast in determineSymKind/Mutability/Ident

d7de955

handle UncheckedArray in gpuTypeToString on CUDA backend

ba89dcb

rename constant to set workgroup size

a2bbf98

minor cleanup

1c5ff0c

correctly handle gtUA (UncheckedArray) on CUDA backend

b98f069

The `gtUA` type enum element is new and was not correctly handled yet on the CUDA backend.

fix access of type for left ident in assignment

c0e1eb8

rewrite compound assignment operators in all functions

24afcfe

i.e. x += 5 becomes x = x + 5 etc for any prefix `foo=` in x foo= y we generate x = x foo y

extend GpuFieldInit by a type field

18d582f

We need that type to determine information about what type we actually assign in a `gpuObjConstr`.

make sure to update symbols in global functions too

fbfeab6

Otherwise we don't have up to date type / symbol kind information in globals.

remove local determineIdent from genWebGpu

1f04a82

fix tree representation of GpuAst

ebcff83

add mpairs, pairs iterator for GpuAst

3a32d2b

support default initialization in obj constr fields

a6186ae

As structs only support 'constructible types' in their fields anyway, we can just default initialize all fields the user leaves out in an object constructor.

correctly handle var foo {.constant.} variables by ⇒ globals

f485deb

Those are intended for runtime constants, i.e. further storage buffers in the context of WGSL. In CUDA we'd use `copyToSymbol` to copy the data to them before execution.

Vindaar added 18 commits September 9, 2025 17:14

generate type names for bracket expressions e.g. generics

1b0aa55

improve generic inst -> gpu type by handling nnkSym

1a4dc31

This is not perfect yet (I think). Needs some tests for different situations. Works in practice for what I've used it for at least.

overwrite function names of problematic identifiers

8583d23

E.g. `[]` is not a sensible name on GPU backends. We rename it to `get` for example. Note that the binary operators there likely never appear there. We need to handle those via `gpuBinOp` (and call `maybePatchFnName` from there)

fix reassignment of types in RT codegen for aliases

8ae1427

handle nnkObjConstr correctly when encountering gtGenericInst

ac91c05

do not emit f suffix for float literals

04ddb39

Nim float literals when converting to strings already come with a `f` suffix.

[wgsl] append ; for aliases

7b51b65

support arbitrary values in array literals

c779258

fix top level type definitions

ea2b36e

Because we still had the `farmTopLevel` adding a (now empty) element to the `globalBlocks` our array access to `ctx.globalBlocks[0]` to generate the types didn't actually emit anything.

[cuda] add pass to strip deref if found inside index for pointers

96508f6

But only strip if not pointer to an array type!

remove old comment

c823f9f

better handle replacement of derefs

6527759

This allows us to make a better choice about when to replace and when not to replace.

remove Nim gensym'd suffix from tmpTuple variables

3095d4f

I.e. variables that correspond to tuple unpacking in Nim

Vindaar force-pushed the webgpu-improvements branch from b45d9ca to 3095d4f Compare September 9, 2025 15:15

mratsim reviewed Sep 9, 2025

View reviewed changes

Vindaar added 5 commits September 10, 2025 15:40

replace single letter strings by characters

59c16b6

extend maybeAddType to find types behind Ptr/UA/Array

096b5bc

move all builtins into separate files, one for each backend

1aaf1b8

update doc comment of cuda macro

71c379f

improve explanation of typeOfIndex in CUDA's getType helper

0532143

mratsim approved these changes Sep 11, 2025

View reviewed changes

mratsim merged commit d6aae1e into master Sep 11, 2025
16 checks passed

mratsim deleted the webgpu-improvements branch September 11, 2025 14:30

Uh oh!

WebGPU GPU compiler backend improvements (and small CUDA) #565

WebGPU GPU compiler backend improvements (and small CUDA) #565

Uh oh!

Conversation

Vindaar commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Vindaar commented Sep 1, 2025

Uh oh!

mratsim commented Sep 3, 2025

Uh oh!

gemini-code-assist bot commented Sep 3, 2025

Uh oh!

mratsim commented Sep 3, 2025

Uh oh!

gemini-code-assist bot commented Sep 3, 2025

Summary of Changes

Highlights

Uh oh!

mratsim commented Sep 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mratsim commented Sep 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mratsim commented Sep 11, 2025

Uh oh!

Uh oh!

Vindaar commented Sep 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Vindaar commented Aug 21, 2025 •

edited

Loading