Skip to content

Conversation

@Vindaar
Copy link
Collaborator

@Vindaar Vindaar commented Aug 21, 2025

This improves the initial WebGPU backend added in #564 to make it practically useful. Aside from adding a lot of features and bug fixes for the WGSL spec, it also makes changes towards making the entire cuda macro more useful:

  • Nim generic functions can now be used inside the GPU code. We emit one function for each generic instantiation. This of course includes e.g. static int arguments. We now could replace the templates that receive a static parameter by a regular generic, e.g. for the BigInt type and generate the correct functions and types based on what is actually instantiated. For the moment we don't update the code in gpu_field_ops.nim though.
  • Functions don't need to be defined inside the cuda block anymore. We will pull any code used inside of the GPU block in, if it is called. This is a good step towards making code naturally run on CPU and GPU for example. Only the code called from the host still needs to be in the cuda macro. Note that keeping GPU code in templates can still be a good idea to not pollute the namespace and if one wants a clearer separation between GPU and CPU code.
  • Similarly types are also pulled in from outside the macro, if they are used.

As a result also the CUDA backend was updated slightly to follow a similar structure as the WGSL backend. That is have a preprocess pass before the actual codegen pass to scan the global functions for functions and types actually used.

Other notable additions / changes:

  • [CUDA] compiling with -d:debugCuda now compiles the CUDA code with debug symbols. Useful if one wants to e.g. run compute-sanitizer with memcheck or cuda-gdb.
  • [CUDA] update the cuModuleGetGlobal binding to use v2 as well
  • [WGSL] add some compile time errors for invalid code
  • [WGSL] have a compiler pass that transforms some code invalid in WGSL into valid forms, e.g.
    • compound assignment operators x += y -> x = x + y
    • WGSL does not support struct fields with pointers. Pointers are extremely limited. As a result we can emulate struct pointer fields by replacing foo.ptrField by whatever we assign in the constructor to that field in the code. Only storage buffers (which are global) are allowed to be assigned as such (local pointers are extremely limited anyway). As they are global, the replacement works everywhere. We error for invalid pointer handling.

@Vindaar Vindaar force-pushed the webgpu-improvements branch from 0ea9ea8 to 8281785 Compare September 1, 2025 11:14
@Vindaar
Copy link
Collaborator Author

Vindaar commented Sep 1, 2025

NOTE: I just rebased onto current master, which means this PR is now out of date to the webgpu-backend PR. Hence the conflicts and ginormous diff.

Once the first PR is merged, I'll rebase onto master again to cut that down. Don't want to rebase the other PR needlessly now.

Base automatically changed from webgpu-backend to master September 1, 2025 13:15
@Vindaar Vindaar force-pushed the webgpu-improvements branch from d779e9a to 7a42e13 Compare September 1, 2025 13:24
@mratsim
Copy link
Owner

mratsim commented Sep 3, 2025

/gemini summary
/gemini review

@gemini-code-assist
Copy link

It looks like you're trying to invoke the /gemini summary and /gemini review commands. These are commands for the Gemini Code Assist bot's core functionality to provide a pull request summary and a code review. As a comment responder, I don't directly execute these commands. The bot should process these commands automatically.

@mratsim
Copy link
Owner

mratsim commented Sep 3, 2025

/gemini summary

@gemini-code-assist
Copy link

Summary of Changes

This pull request substantially advances the WebGPU compiler backend, enhancing its functionality and addressing specification compliance. Concurrently, it refines the CUDA backend to mirror these architectural improvements, particularly in how code and types are managed and included. These changes aim to foster greater flexibility in writing code that can execute seamlessly across both CPU and GPU environments.

Highlights

  • WebGPU Backend Enhancements: Significant improvements to the WebGPU backend, including new features and bug fixes for the WGSL specification, making it practically useful.
  • CUDA Macro Flexibility: The cuda macro now supports Nim generic functions within GPU code (emitting one function per instantiation), and functions/types no longer strictly need to be defined inside the cuda block, promoting more natural CPU/GPU code sharing.
  • Unified Preprocessing: The CUDA backend has been updated to adopt a similar preprocess pass structure as the WGSL backend, scanning for actually used functions and types.
  • CUDA Debugging & Bindings: Compiling with -d:debugCuda now includes debug symbols for CUDA code, and the cuModuleGetGlobal binding has been updated to use v2.
  • WGSL Code Transformation: New compiler passes for WGSL add compile-time error checking and transform invalid code into valid forms, such as converting compound assignment operators (x += y to x = x + y) and emulating struct pointer fields.
Changelog
  • constantine/math/arithmetic/finite_fields_square_root.nim
    • Changed type parameters from Fp to FF in various invsqrt and sqrt functions for broader applicability.
    • Updated static assertions and pow_vartime calls to use FF.Name instead of Fp.Name.
    • Corrected type of i in comments from Fp to FF.
    • Updated isSquare function to use FF.getBigInt() and FF.getModulus().limbs.
  • constantine/math/arithmetic/finite_fields_square_root_precomp.nim
    • Changed type parameters from Fp to FF in sqrtAlg_NegDlogInSmallDyadicSubgroup_vartime, sqrtAlg_GetPrecomputedRootOfUnity, and invSqrtEqDyadic_vartime functions.
    • Updated references to Fp.Name.sqrtDlog to FF.Name.sqrtDlog.
  • constantine/math_compiler/experimental/backends/backends.nim
    • Removed gpuTypeToString and genFunctionType procs, indicating a refactoring of type handling.
    • Modified codegen proc for both CUDA and WGSL backends to include a preprocess step before actual code generation.
  • constantine/math_compiler/experimental/backends/common_utils.nim
    • Added isGlobal proc to check if a GpuAst (function) has the attGlobal attribute.
    • Introduced farmTopLevel proc to organize top-level AST nodes into functions, variables, and types for processing.
  • constantine/math_compiler/experimental/backends/cuda.nim
    • Imported tables and algorithm modules.
    • Added gtUA (UncheckedArray) to gpuTypeToString with an empty string representation.
    • Modified gpuTypeToString to handle ptr UncheckedArray by removing the gtUA layer.
    • Added gtGenericInst handling to gpuTypeToString to format generic instantiations (e.g., foo_f32_u32).
    • Introduced scanFunctions proc to traverse function ASTs and identify called functions for inclusion in fnTab (dead code elimination).
    • Added preprocess proc to handle generic instantiations, types, and global functions before CUDA code generation.
    • Modified genCuda for gpuProc to support forward declarations with a semicolon.
    • Adjusted genCuda for gpuVar to handle attributes more flexibly.
    • Modified genCuda for gpuBinOp to use ctx.withoutSemicolon and generate bOp as a GpuAst.
    • Updated genCuda for gpuArrayLit to recursively generate code for array elements.
    • Changed genCuda for gpuTypeDef to use gpuTypeToString(ast.tTyp) for struct names.
    • Modified genCuda for gpuConstexpr to use constexpr instead of __constant__.
    • Added codegen proc to orchestrate the generation of global blocks, forward declarations, and function definitions.
  • constantine/math_compiler/experimental/backends/wgsl.nim
    • Changed gtFloat32 literalSuffix to an empty string, as Nim already adds 'f'.
    • Adjusted constructPtrSignature to check idTyp.kind != gtVoid for mutability.
    • Added gtGenericInst handling to gpuTypeToString for formatting generic instantiations.
    • Added gpuCast to determineSymKind, determineMutability, and determineIdent.
    • Introduced scanGenerics to handle gpuObjConstr for pointer fields in structs, recording them for later replacement.
    • Added rewriteCompoundAssignment to transform x += y to x = x + y.
    • Introduced getStructType to retrieve struct types from gpuIdent or gpuDeref nodes.
    • Added makeCodeValid to perform various WGSL-specific AST rewrites and checks (e.g., compound assignments, struct pointer field emulation, gpuCall signature updates, gpuObjConstr field deletion, gpuVar type updates).
    • Added updateSymsInGlobals to ensure global function symbols match parameter mutability and kind.
    • Added checkCodeValid to enforce WGSL rules (e.g., no var to pointer types).
    • Introduced pullConstantPragmaVars to extract {.constant.} variables into ctx.globals.
    • Added removeStructPointerFields to remove pointer fields from structs.
    • Modified preprocess to use farmTopLevel from common_utils, handle generic instantiations and types, remove global function arguments, pull constant pragma vars, remove struct pointer fields, inject address-of operations, and apply makeCodeValid and checkCodeValid passes.
    • Updated genWebGpu for gpuProc to include num_workgroups in the builtin parameters for global functions.
    • Updated genWebGpu for gpuAssign to use the new determineIdent and handle gtPtr to gtInt32 assignments.
    • Modified genWebGpu for gpuBinOp to use ctx.withoutSemicolon and recursively generate bOp.
    • Updated genWebGpu for gpuArrayLit to recursively generate code for array elements.
    • Changed genWebGpu for gpuTypeDef to use gpuTypeToString(ast.tTyp) for struct names.
    • Added gpuAlias handling to genWebGpu.
    • Modified genWebGpu for gpuObjConstr to use gpuTypeToString(ast.ocType) and handle "DEFAULT" literal for default construction.
    • Updated codegen to use WORKGROUP_SIZE instead of NUM_WORKGROUPS.
  • constantine/math_compiler/experimental/cuda_execute_dsl.nim
    • Modified requiresCopy to return true for ntyArray (statically sized arrays are passed by pointer in CUDA/C++/C).
    • Added ntyAlias handling to requiresCopy to check for cudeviceptr.
  • constantine/math_compiler/experimental/gpu_compiler.nim
    • Added builtin template for marking built-in functions/types/variables.
    • Annotated blockIdx, blockDim, gridDim, threadIdx, global_id, num_workgroups, printf, memcpy, malloc, free, syncthreads, and select with {.builtin.}.
    • Modified toGpuAst macro to return a tuple (GpuGenericsInfo, GpuAst).
    • Updated cuda macro to use the new toGpuAst return type.
    • Modified codegen proc to accept GpuGenericsInfo and populate ctx.genericInsts and ctx.types.
  • constantine/math_compiler/experimental/gpu_field_ops.nim
    • Added detailed comments to add_co, add_cio, add_ci, sub_bo, sub_bio, sub_bi, and slct procs.
    • Added defBigIntCompare template, including less proc for BigInt comparison and toCanonical for Montgomery form conversion.
    • Introduced muladd1_gpu and muladd2_gpu for extended precision multiplication and addition.
    • Added sub_no_mod and csub_no_mod for non-modular subtraction.
    • Implemented fromMont_CIOS for converting from Montgomery form.
    • Modified finalSubMayOverflow to accept overflowedLimbs as a parameter.
    • Adjusted modadd to compute overflowedLimbs before calling finalSubMayOverflow.
    • Updated comments for ccopy, csetZero, csetOne, cadd, csub, doubleElement, nsqr, isZero, isOdd, neg, cneg, shiftRight, and div2 to remove "in CUDA".
    • Added isZero(a: BigInt) and isNonZero(a: BigInt) overloads.
    • Introduced mul_lohi for low and high parts of multiplication.
    • Added mulAcc for accumulating products.
    • Implemented mtymul_FIPS for Montgomery Multiplication using Finely Integrated Product Scanning.
    • Re-added mul proc, conditionally using mtymul_CIOS_sparebit or mtymul_FIPS based on spareBits().
  • constantine/math_compiler/experimental/gpu_types.nim
    • Added gpuAlias to GpuAstKind enum.
    • Added gtGenericInst and gtInvalid to GpuTypeKind enum.
    • Added builtin field to GpuType object.
    • Added gName, gArgs, gFields fields for gtGenericInst in GpuType.
    • Added forwardDeclare field to gpuProc in GpuAst.
    • Added cIsExpr field to gpuCall in GpuAst.
    • Changed bOp in gpuBinOp to GpuAst and added bLeftTyp, bRightTyp.
    • Changed aValues in gpuArrayLit to seq[GpuAst].
    • Added isExpr to gpuBlock.
    • Changed tName in gpuTypeDef to tTyp (GpuType).
    • Added aTyp, aTo, aDistinct fields for gpuAlias.
    • Changed ocName in gpuObjConstr to ocType (GpuType).
    • Added typ field to GpuFieldInit.
    • Added GpuProcSignature object with params and retType.
    • Added structsWithPtrs, generics, genericInsts, processedProcs, builtins, types, symChoices fields to GpuContext.
    • Added GpuGenericsInfo object with procs and types.
    • Updated clone procs for GpuType (handling gtGenericInst) and GpuAst (handling gpuProc, gpuCall, gpuBinOp, gpuArrayLit, gpuBlock, gpuTypeDef, gpuAlias, gpuObjConstr).
    • Updated hash and == procs for GpuType to include gtGenericInst and GpuProcSignature.
    • Added pretty proc for GpuType.
    • Updated pretty proc for GpuAst to reflect new AST node structures and fields.
    • Added mpairs and pairs iterators for GpuAst.
    • Modified withoutSemicolon template to handle nested calls.
  • constantine/math_compiler/experimental/nim_to_gpu.nim
    • Modified nimToGpuType to accept allowToFail and allowArrayIdent parameters.
    • Updated initGpuPtrType and initGpuUAType to handle gtInvalid inner types.
    • Added toTypeDef proc to convert object/generic types to gpuTypeDef AST.
    • Added getGenericTypeName, parseGenericArgs, initGpuGenericInst for handling generic types.
    • Modified getInnerPointerType to accept allowToFail and allowArrayIdent.
    • Modified determineArrayLength to accept allowArrayIdent and return -1 for symbolic array lengths.
    • Added constructTupleTypeName for generating names for tuple types.
    • Modified getTypeName to handle nnkSym (recursing for type instance), nnkObjConstr, nnkTupleTy, nnkTupleConstr, and nnkBracketExpr.
    • Updated nimToGpuType to handle ntyString, ntyObject, ntyAlias, ntyTuple, ntyGenericInvocation, ntyGenericInst, ntyTypeDesc, ntyUnused2.
    • Added gpuTypeMaybeFromSymbol to attempt type resolution from symbols.
    • Added maybeAddType to add object/generic types to ctx.types.
    • Added parseProcParameters and parseProcReturnType to extract function parameters and return types.
    • Added toGpuProcSignature to create GpuProcSignature objects.
    • Added addProcToGenericInsts to handle generic function instantiations and pull in external procs.
    • Added isExpression to determine if a GpuAst node is an expression.
    • Added maybeInsertResult to automatically insert result variable and return result statement in procs.
    • Added fnReturnsValue to check if a function returns a value.
    • Modified toGpuAst for nnkBlockStmt and nnkStmtListExpr to set isExpr.
    • Modified toGpuAst for nnkProcDef, nnkFuncDef to handle generics, built-ins, and populate ctx.allFnTab.
    • Updated toGpuAst for nnkLetSection, nnkVarSection to use gpuTypeMaybeFromSymbol and maybeAddType.
    • Modified toGpuAst for nnkTemplate to return gpuVoid.
    • Modified toGpuAst for nnkCall, nnkCommand to handle generics and determine if the function returns a value.
    • Modified toGpuAst for nnkInfix to determine bLeftTyp, bRightTyp, handle basic types vs. function calls, and patch literal types.
    • Modified toGpuAst for nnkBracketExpr to handle tuple field access.
    • Updated toGpuAst for nnkSym to use ctx.sigTab and ctx.genSymCount.
    • Modified toGpuAst for nnkTypeDef to handle generic type definitions, aliases, and populate ctx.types.
    • Modified toGpuAst for nnkObjConstr and nnkTupleConstr to populate ocFields with default values and types.
    • Modified toGpuAst for nnkBracket to use GpuAst for aValues.
    • Added nnkWhenStmt to toGpuAst with an assertion.
  • constantine/math_compiler/experimental/runtime_compile.nim
    • Modified compile proc to conditionally add debug flags (--device-debug, --generate-line-info) when debugCuda is defined.
  • constantine/platforms/abis/nvidia_abi.nim
    • Removed {.importc.} pragma from CUcontext, CUevent, CUmodule, CUfunction, CUstream, CUlinkState types.
    • Removed {.importc.} pragma from CUdeviceptr_v2 and CUdeviceptr types.
    • Added CUlimit enum.
    • Modified cuCtxCreate to use v2 pragma.
    • Added cuCtxDestroy with v2 pragma.
    • Modified cuCtxSynchronize to use v2 pragma.
    • Added cuCtxSetLimit proc.
    • Removed cuModuleGetGlobal (v1) and re-added it with v2 pragma.
Activity
  • Vindaar rebased the PR onto master, resulting in a large diff and conflicts, and plans to rebase again after a related PR is merged.
  • mratsim requested a summary and review from the bot.
  • The bot responded, explaining its role and confirming the commands.
  • mratsim re-requested a summary.

@mratsim
Copy link
Owner

mratsim commented Sep 3, 2025

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an impressive pull request that significantly improves the GPU compiler backends. The architectural changes, like introducing a preprocess pass for both WGSL and CUDA, bring great consistency. The support for using Nim generic functions in GPU code and automatically pulling in necessary functions and types is a major step forward for usability and code reuse.

The refactoring to make field operations generic is clean and well-executed. The numerous additions and fixes for the WGSL backend, especially the workarounds for its limitations, are clever and show a deep understanding of the target platform. The CUDA backend also benefits from the structural alignment and new features like debug symbol generation.

I have one suggestion regarding the use of constexpr vs __constant__ in the CUDA backend to fully realize the performance benefits of constant memory.

Comment on lines 357 to 439
of gpuConstexpr:
## TODO: We need to change the code such that we emit `constexpr` inside of procs and
## `__constant__` outside of procs. The point is we want to support mapping to `__constant__`
## for `const foo = bar` Nim declarations to evaluate values at Nim's compile time.
## Alternatively, make user write `const foo {.constant.} = bar` to produce a global
## `__constant__` value.
if ast.cType.kind == gtArray:
result = indentStr & "__constant__ " & gpuTypeToString(ast.cType, ctx.genCuda(ast.cIdent)) & " = " & ctx.genCuda(ast.cValue)
result = indentStr & "constexpr " & gpuTypeToString(ast.cType, ctx.genCuda(ast.cIdent)) & " = " & ctx.genCuda(ast.cValue)
else:
result = indentStr & "__constant__ " & gpuTypeToString(ast.cType, allowEmptyIdent = true) & " " & ctx.genCuda(ast.cIdent) & " = " & ctx.genCuda(ast.cValue)
result = indentStr & "constexpr " & gpuTypeToString(ast.cType, allowEmptyIdent = true) & " " & ctx.genCuda(ast.cIdent) & " = " & ctx.genCuda(ast.cValue)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While the change from __constant__ to constexpr allows constants inside functions, it currently doesn't leverage __constant__ memory for global constants, which can offer performance benefits. The TODO comment indicates you're aware of this.

To implement this, you could introduce a flag to genCuda, for example inGlobalScope: bool, to distinguish between global and function scopes. The top-level codegen function could then control this flag:

// in codegen*
for blk in ctx.globalBlocks:
  result.add ctx.genCuda(blk, inGlobalScope = true) & ";\n\n"

// ...

for fnIdent, fn in ctx.fnTab:
  result.add ctx.genCuda(fn, inGlobalScope = false) & "\n\n"

Inside genCuda for gpuProc, you would then propagate inGlobalScope = false when generating the function body. This would allow gpuConstexpr to switch between __constant__ and constexpr based on the scope.

@mratsim
Copy link
Owner

mratsim commented Sep 5, 2025

This needs a rebase following #576

The type of the identifier is now always `gtVoid`, so the previous
check `not idTyp.isNil` does not work anylonger.
This is only for the optional helper used on the WebGPU backend, where
we try to determine the type in an infix expression. However, for some
arguments to the infix this is not uniquely possible. I.e. we might
encounter `SomeInteger`, which is not a unique type. In this case we
just fall back to not assigning a known type.
The `gtUA` type enum element is new and was not correctly handled yet
on the CUDA backend.
i.e.

x += 5

becomes

x = x + 5

etc for any prefix `foo=` in

x foo= y

we generate

x = x foo y
We need that type to determine information about what type we actually
assign in a `gpuObjConstr`.
Otherwise we don't have up to date type / symbol kind information in globals.
As structs only support 'constructible types' in their fields anyway,
we can just default initialize all fields the user leaves out in an
object constructor.
Those are intended for runtime constants, i.e. further storage buffers
in the context of WGSL. In CUDA we'd use `copyToSymbol` to copy the
data to them before execution.
This is not perfect yet (I think). Needs some tests for different
situations. Works in practice for what I've used it for at least.
E.g. `[]` is not a sensible name on GPU backends. We rename it to
`get` for example. Note that the binary operators there likely never
appear there. We need to handle those via `gpuBinOp` (and call
`maybePatchFnName` from there)
NOTE: We'll change the `maybeAddType` code in the future to be done in
`nimToGpuType` directly. However, at the moment that produces a bit
too much required change with having to add `GpuContext` to a whole
bunch of functions that all call `nimToGpuType` internally.
Previously due to our parsing of procs whenever they are called, we
would infinitely recurse in the parsing logic. We now record the
function signature and function identifier so that we can avoid that.

We need the function signature to get information about the return
type before we actually start parsing a function. Otherwise _inside_
of the recursive function we wouldn't be able to determine the return
type at the callsite of the recursive call (the initial parse hasn't
been completed at that point yet, which would fill the proc into `allFnTab`)
In a debug build (`-d:debugCuda`) modular addition produced off by one
errors.

As it turned out the problem was our calculation of `overflowedLimbs`
inside of `finalSubMayOverflow`. The carry flag set in the last
`add_cio` call in `modadd` does not reliably survive into the function
call in a debug build. We compute it directly after computing the last
`add_cio` call in `modadd` and simply pass it as an argument. This way
arithmetic also works reliably on a debug build.
Nim float literals when converting to strings already come with a `f` suffix.
The idea is that if the arguments are not basic types, we will need a
custom function to perform the infix operation.

Most backends however do not support custom operators and hence actual
`gpuBinOp` are not valid for custom types in general. Hence, we
rewrite them as `gpuCall` nodes with non-symbol based naming.

NOTE: Currently this does not handle the case where we might use an
inbuilt type like `vec3` and its implementation of infix operators
that _may_ be defined after all.

We need to find an elegant solution for that. Either by checking if
argument are of basic types (like in this commit) or if they are a
type annotated with `{.builtin.}`.

Alternatively, we could force the user to define operators for such
inbuilt types (i.e. wrap them) and then if there is a wrapper that is
marked `{.builtin.}` we don't replace infix by `gpuCall` either.
I.e. instead of just:

```nim
Vec[float32]()
```
being:
```
ObjConstr
  BracketExpr
    Sym "Vec3"
    Sym "float32"
```

Also handle the case of:

```
Vec3[float32](limbs: [1'f32, 1'f32, 1'f32])
```
being:
```
ObjConstr
  BracketExpr
    Sym "Vec3"
    Sym "float32"
  ExprColonExpr
    Sym "limbs"
    Bracket
      Float32Lit 1.0
      Float32Lit 1.0
      Float32Lit 1.0
```
Because we still had the `farmTopLevel` adding a (now empty) element
to the `globalBlocks` our array access to `ctx.globalBlocks[0]` to
generate the types didn't actually emit anything.
But only strip if not pointer to an array type!
This allows us to make a better choice about when to replace and when
not to replace.
I.e. variables that correspond to tuple unpacking in Nim
@Vindaar Vindaar force-pushed the webgpu-improvements branch from b45d9ca to 3095d4f Compare September 9, 2025 15:15
proc getType(ctx: var GpuContext, arg: GpuAst, typeOfIndex = true): GpuType =
## Tries to determine the underlying type of the AST.
##
## If `typeOfIndex` is `true`, we return the type of the index we access. Otherwise
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is a bit confusingly worded.

Let's say we have let foo = [float32 1, 2, 4, 8]

and we do ctx.getType(foo[1]), do we get float32 or int.

  • In the first case, we get the type of the item.
  • In the second case we get the type of the index, but in which scenario would we need that proc with those input, it's very unnatural?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit confusing, yes. I'll rephrase it.

By "type of the index", I mean it returns the type of an element of the gpuIndex operation (that is a [] accessor). The default is typeOfIndex = true, because that is more in line of what you expect when you write:

let foo = [float32 1, 2, 4, 8]
ctx.getType(foo[1])

which would return float32.

However, if typeOfIndex = false we instead return array[4, float32].

Currently we only use the case of typeOfIndex = false, because I only use this helper to determine the type of a gpuIndex of a gpuDeref expression, to check if we need to remove the gpuDeref layer.

But having a getType that works intuitively on GpuAst nodes (similar to getType on NimNode) could be useful. Indeed I had multiple cases where I almost implemented it before, but then realized I didn't quite need it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the explanation.

## If `typeOfIndex` is `true`, we return the type of the index we access. Otherwise
## we return the type of the array / pointer.
##
## NOTE: Do *not* rely on this for `mutable` or `implicit` fields of pointer types!
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to auto-handle HiddenDeref

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you have in mind by auto-handle? We are auto handling HiddenDeref. What we might want to do in the future is to copy Nim's AST by explicitly differentiating between gpuDeref and gpuHiddenDeref (which doesn't exist yet).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mutable types are passed as nnkHiddenDeref but it doesn't seem like getInnerPointerType handles it hence my remark.

proc getInnerPointerType(n: NimNode, allowToFail: bool = false, allowArrayIdent: bool = false): GpuType =
  doAssert n.typeKind in {ntyPtr, ntyPointer, ntyUncheckedArray, ntyVar} or n.kind == nnkPtrTy, "But was: " & $n.treerepr & " of typeKind " & $n.typeKind
  if n.typeKind in {ntyPointer, ntyUncheckedArray}:
    let typ = n.getTypeInst()
    doAssert typ.kind == nnkBracketExpr, "No, was: " & $typ.treerepr
    doAssert typ[0].kind in {nnkIdent, nnkSym}
    doAssert typ[0].strVal in ["ptr", "UncheckedArray"]
    result = nimToGpuType(typ[1], allowToFail, allowArrayIdent)
  elif n.kind == nnkPtrTy:
    result = nimToGpuType(n[0], allowToFail, allowArrayIdent)
  elif n.kind == nnkAddr:
    let typ = n.getTypeInst()
    result = getInnerPointerType(typ, allowToFail, allowArrayIdent)
  elif n.kind == nnkVarTy:
    # VarTy
    #   Sym "BigInt"
    result = nimToGpuType(n[0], allowToFail, allowArrayIdent)
  elif n.kind == nnkSym: # symbol of e.g. `ntyVar`
    result = nimToGpuType(n.getTypeInst(), allowToFail, allowArrayIdent)
  else:
    raiseAssert "Found what: " & $n.treerepr

## Addresses other AST patterns that need to be rewritten on CUDA. Aspects
## that are rewritten include:
##
## - `Index` of `Deref` of `Ident` needs to be rewritten to `Index` of `Ident` if the
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concrete examples would help maintenance, debugging or generalizing in the future

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have test cases for (I think) all of these (incl the WebGPU ones) as part of the Lita repository. I can move those over here, I suppose.

## `storage` buffers, which are filled before the kernel is executed.
##
## XXX: Document current not ideal behavior that one needs to be careful to pass data into
## `wgsl.fakeExecute` precisely in the order in which the `var foo {.constant.}` are defined
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that something fixable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's fixable. Mostly I wasn't sure about the cleanest solution and given that fakeExecute is not even part of this PR at the moment, I left that for the future to save time.

proc add_ci(a: uint32, b: uint32): uint32 {.device.} =
# Add with carry in only.
# NOTE: `carry_flag` is not reset, because the next call after
# an `add_ci` *must* be `add_co` or `sub_bo`, but never
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better explanation is that add_ci consumes the carry flag so next operation must be independent from it. i.e. add is fine if for completely separate operands/purposes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point of the comment was that resetting the carry flag explicitly would be costly and only an issue if the operations are used in the wrong order.

Ideally we'd just set carry_flag = 0 at the end of this proc.

## If `typeOfIndex` is `true`, we return the type of the index we access. Otherwise
## we return the type of the array / pointer.
##
## NOTE: Do *not* rely on this for `mutable` or `implicit` fields of pointer types!
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mutable types are passed as nnkHiddenDeref but it doesn't seem like getInnerPointerType handles it hence my remark.

proc getInnerPointerType(n: NimNode, allowToFail: bool = false, allowArrayIdent: bool = false): GpuType =
  doAssert n.typeKind in {ntyPtr, ntyPointer, ntyUncheckedArray, ntyVar} or n.kind == nnkPtrTy, "But was: " & $n.treerepr & " of typeKind " & $n.typeKind
  if n.typeKind in {ntyPointer, ntyUncheckedArray}:
    let typ = n.getTypeInst()
    doAssert typ.kind == nnkBracketExpr, "No, was: " & $typ.treerepr
    doAssert typ[0].kind in {nnkIdent, nnkSym}
    doAssert typ[0].strVal in ["ptr", "UncheckedArray"]
    result = nimToGpuType(typ[1], allowToFail, allowArrayIdent)
  elif n.kind == nnkPtrTy:
    result = nimToGpuType(n[0], allowToFail, allowArrayIdent)
  elif n.kind == nnkAddr:
    let typ = n.getTypeInst()
    result = getInnerPointerType(typ, allowToFail, allowArrayIdent)
  elif n.kind == nnkVarTy:
    # VarTy
    #   Sym "BigInt"
    result = nimToGpuType(n[0], allowToFail, allowArrayIdent)
  elif n.kind == nnkSym: # symbol of e.g. `ntyVar`
    result = nimToGpuType(n.getTypeInst(), allowToFail, allowArrayIdent)
  else:
    raiseAssert "Found what: " & $n.treerepr

@mratsim
Copy link
Owner

mratsim commented Sep 11, 2025

The PR is in good shape so merging it to unblock Lita.

Leftover TODOs:

  • copy test cases from Lita repos to here for improved maintenance and anti-regression

@mratsim mratsim merged commit d6aae1e into master Sep 11, 2025
16 checks passed
@mratsim mratsim deleted the webgpu-improvements branch September 11, 2025 14:30
@Vindaar
Copy link
Collaborator Author

Vindaar commented Sep 11, 2025

mutable types are passed as nnkHiddenDeref but it doesn't seem like getInnerPointerType handles it hence my remark.

Mutable types are passed as nnkHiddenDeref, yes. But when looking at the typeKind you get an ntyVar.

The nnkHiddenDeref appears in the context of the symbol being used inside of a proc. But we deduce the type not based on that, but rather the individual symbols / as part of the parameters of procs / arguments.

It is true though that depending on what node you were to pass into getInnerPointerType you wouldn't resolve the hidden addr. But that should never happen. The regular code calls nimToGpuType, which handles var types via ntyVar. getInnerPointerType is only called from there for pointer types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants