-
-
Notifications
You must be signed in to change notification settings - Fork 64
WebGPU GPU compiler backend improvements (and small CUDA) #565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0ea9ea8 to
8281785
Compare
|
NOTE: I just rebased onto current master, which means this PR is now out of date to the Once the first PR is merged, I'll rebase onto master again to cut that down. Don't want to rebase the other PR needlessly now. |
d779e9a to
7a42e13
Compare
|
/gemini summary |
|
It looks like you're trying to invoke the |
|
/gemini summary |
Summary of ChangesThis pull request substantially advances the WebGPU compiler backend, enhancing its functionality and addressing specification compliance. Concurrently, it refines the CUDA backend to mirror these architectural improvements, particularly in how code and types are managed and included. These changes aim to foster greater flexibility in writing code that can execute seamlessly across both CPU and GPU environments. Highlights
Changelog
Activity
|
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This is an impressive pull request that significantly improves the GPU compiler backends. The architectural changes, like introducing a preprocess pass for both WGSL and CUDA, bring great consistency. The support for using Nim generic functions in GPU code and automatically pulling in necessary functions and types is a major step forward for usability and code reuse.
The refactoring to make field operations generic is clean and well-executed. The numerous additions and fixes for the WGSL backend, especially the workarounds for its limitations, are clever and show a deep understanding of the target platform. The CUDA backend also benefits from the structural alignment and new features like debug symbol generation.
I have one suggestion regarding the use of constexpr vs __constant__ in the CUDA backend to fully realize the performance benefits of constant memory.
| of gpuConstexpr: | ||
| ## TODO: We need to change the code such that we emit `constexpr` inside of procs and | ||
| ## `__constant__` outside of procs. The point is we want to support mapping to `__constant__` | ||
| ## for `const foo = bar` Nim declarations to evaluate values at Nim's compile time. | ||
| ## Alternatively, make user write `const foo {.constant.} = bar` to produce a global | ||
| ## `__constant__` value. | ||
| if ast.cType.kind == gtArray: | ||
| result = indentStr & "__constant__ " & gpuTypeToString(ast.cType, ctx.genCuda(ast.cIdent)) & " = " & ctx.genCuda(ast.cValue) | ||
| result = indentStr & "constexpr " & gpuTypeToString(ast.cType, ctx.genCuda(ast.cIdent)) & " = " & ctx.genCuda(ast.cValue) | ||
| else: | ||
| result = indentStr & "__constant__ " & gpuTypeToString(ast.cType, allowEmptyIdent = true) & " " & ctx.genCuda(ast.cIdent) & " = " & ctx.genCuda(ast.cValue) | ||
| result = indentStr & "constexpr " & gpuTypeToString(ast.cType, allowEmptyIdent = true) & " " & ctx.genCuda(ast.cIdent) & " = " & ctx.genCuda(ast.cValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the change from __constant__ to constexpr allows constants inside functions, it currently doesn't leverage __constant__ memory for global constants, which can offer performance benefits. The TODO comment indicates you're aware of this.
To implement this, you could introduce a flag to genCuda, for example inGlobalScope: bool, to distinguish between global and function scopes. The top-level codegen function could then control this flag:
// in codegen*
for blk in ctx.globalBlocks:
result.add ctx.genCuda(blk, inGlobalScope = true) & ";\n\n"
// ...
for fnIdent, fn in ctx.fnTab:
result.add ctx.genCuda(fn, inGlobalScope = false) & "\n\n"Inside genCuda for gpuProc, you would then propagate inGlobalScope = false when generating the function body. This would allow gpuConstexpr to switch between __constant__ and constexpr based on the scope.
|
This needs a rebase following #576 |
The type of the identifier is now always `gtVoid`, so the previous check `not idTyp.isNil` does not work anylonger.
This is only for the optional helper used on the WebGPU backend, where we try to determine the type in an infix expression. However, for some arguments to the infix this is not uniquely possible. I.e. we might encounter `SomeInteger`, which is not a unique type. In this case we just fall back to not assigning a known type.
The `gtUA` type enum element is new and was not correctly handled yet on the CUDA backend.
i.e. x += 5 becomes x = x + 5 etc for any prefix `foo=` in x foo= y we generate x = x foo y
We need that type to determine information about what type we actually assign in a `gpuObjConstr`.
Otherwise we don't have up to date type / symbol kind information in globals.
As structs only support 'constructible types' in their fields anyway, we can just default initialize all fields the user leaves out in an object constructor.
Those are intended for runtime constants, i.e. further storage buffers in the context of WGSL. In CUDA we'd use `copyToSymbol` to copy the data to them before execution.
This is not perfect yet (I think). Needs some tests for different situations. Works in practice for what I've used it for at least.
E.g. `[]` is not a sensible name on GPU backends. We rename it to `get` for example. Note that the binary operators there likely never appear there. We need to handle those via `gpuBinOp` (and call `maybePatchFnName` from there)
NOTE: We'll change the `maybeAddType` code in the future to be done in `nimToGpuType` directly. However, at the moment that produces a bit too much required change with having to add `GpuContext` to a whole bunch of functions that all call `nimToGpuType` internally.
Previously due to our parsing of procs whenever they are called, we would infinitely recurse in the parsing logic. We now record the function signature and function identifier so that we can avoid that. We need the function signature to get information about the return type before we actually start parsing a function. Otherwise _inside_ of the recursive function we wouldn't be able to determine the return type at the callsite of the recursive call (the initial parse hasn't been completed at that point yet, which would fill the proc into `allFnTab`)
In a debug build (`-d:debugCuda`) modular addition produced off by one errors. As it turned out the problem was our calculation of `overflowedLimbs` inside of `finalSubMayOverflow`. The carry flag set in the last `add_cio` call in `modadd` does not reliably survive into the function call in a debug build. We compute it directly after computing the last `add_cio` call in `modadd` and simply pass it as an argument. This way arithmetic also works reliably on a debug build.
Nim float literals when converting to strings already come with a `f` suffix.
The idea is that if the arguments are not basic types, we will need a
custom function to perform the infix operation.
Most backends however do not support custom operators and hence actual
`gpuBinOp` are not valid for custom types in general. Hence, we
rewrite them as `gpuCall` nodes with non-symbol based naming.
NOTE: Currently this does not handle the case where we might use an
inbuilt type like `vec3` and its implementation of infix operators
that _may_ be defined after all.
We need to find an elegant solution for that. Either by checking if
argument are of basic types (like in this commit) or if they are a
type annotated with `{.builtin.}`.
Alternatively, we could force the user to define operators for such
inbuilt types (i.e. wrap them) and then if there is a wrapper that is
marked `{.builtin.}` we don't replace infix by `gpuCall` either.
I.e. instead of just:
```nim
Vec[float32]()
```
being:
```
ObjConstr
BracketExpr
Sym "Vec3"
Sym "float32"
```
Also handle the case of:
```
Vec3[float32](limbs: [1'f32, 1'f32, 1'f32])
```
being:
```
ObjConstr
BracketExpr
Sym "Vec3"
Sym "float32"
ExprColonExpr
Sym "limbs"
Bracket
Float32Lit 1.0
Float32Lit 1.0
Float32Lit 1.0
```
Because we still had the `farmTopLevel` adding a (now empty) element to the `globalBlocks` our array access to `ctx.globalBlocks[0]` to generate the types didn't actually emit anything.
But only strip if not pointer to an array type!
This allows us to make a better choice about when to replace and when not to replace.
I.e. variables that correspond to tuple unpacking in Nim
b45d9ca to
3095d4f
Compare
| proc getType(ctx: var GpuContext, arg: GpuAst, typeOfIndex = true): GpuType = | ||
| ## Tries to determine the underlying type of the AST. | ||
| ## | ||
| ## If `typeOfIndex` is `true`, we return the type of the index we access. Otherwise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is a bit confusingly worded.
Let's say we have let foo = [float32 1, 2, 4, 8]
and we do ctx.getType(foo[1]), do we get float32 or int.
- In the first case, we get the type of the item.
- In the second case we get the type of the index, but in which scenario would we need that proc with those input, it's very unnatural?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit confusing, yes. I'll rephrase it.
By "type of the index", I mean it returns the type of an element of the gpuIndex operation (that is a [] accessor). The default is typeOfIndex = true, because that is more in line of what you expect when you write:
let foo = [float32 1, 2, 4, 8]
ctx.getType(foo[1])
which would return float32.
However, if typeOfIndex = false we instead return array[4, float32].
Currently we only use the case of typeOfIndex = false, because I only use this helper to determine the type of a gpuIndex of a gpuDeref expression, to check if we need to remove the gpuDeref layer.
But having a getType that works intuitively on GpuAst nodes (similar to getType on NimNode) could be useful. Indeed I had multiple cases where I almost implemented it before, but then realized I didn't quite need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the explanation.
| ## If `typeOfIndex` is `true`, we return the type of the index we access. Otherwise | ||
| ## we return the type of the array / pointer. | ||
| ## | ||
| ## NOTE: Do *not* rely on this for `mutable` or `implicit` fields of pointer types! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to auto-handle HiddenDeref
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you have in mind by auto-handle? We are auto handling HiddenDeref. What we might want to do in the future is to copy Nim's AST by explicitly differentiating between gpuDeref and gpuHiddenDeref (which doesn't exist yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mutable types are passed as nnkHiddenDeref but it doesn't seem like getInnerPointerType handles it hence my remark.
proc getInnerPointerType(n: NimNode, allowToFail: bool = false, allowArrayIdent: bool = false): GpuType =
doAssert n.typeKind in {ntyPtr, ntyPointer, ntyUncheckedArray, ntyVar} or n.kind == nnkPtrTy, "But was: " & $n.treerepr & " of typeKind " & $n.typeKind
if n.typeKind in {ntyPointer, ntyUncheckedArray}:
let typ = n.getTypeInst()
doAssert typ.kind == nnkBracketExpr, "No, was: " & $typ.treerepr
doAssert typ[0].kind in {nnkIdent, nnkSym}
doAssert typ[0].strVal in ["ptr", "UncheckedArray"]
result = nimToGpuType(typ[1], allowToFail, allowArrayIdent)
elif n.kind == nnkPtrTy:
result = nimToGpuType(n[0], allowToFail, allowArrayIdent)
elif n.kind == nnkAddr:
let typ = n.getTypeInst()
result = getInnerPointerType(typ, allowToFail, allowArrayIdent)
elif n.kind == nnkVarTy:
# VarTy
# Sym "BigInt"
result = nimToGpuType(n[0], allowToFail, allowArrayIdent)
elif n.kind == nnkSym: # symbol of e.g. `ntyVar`
result = nimToGpuType(n.getTypeInst(), allowToFail, allowArrayIdent)
else:
raiseAssert "Found what: " & $n.treerepr
| ## Addresses other AST patterns that need to be rewritten on CUDA. Aspects | ||
| ## that are rewritten include: | ||
| ## | ||
| ## - `Index` of `Deref` of `Ident` needs to be rewritten to `Index` of `Ident` if the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concrete examples would help maintenance, debugging or generalizing in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have test cases for (I think) all of these (incl the WebGPU ones) as part of the Lita repository. I can move those over here, I suppose.
| ## `storage` buffers, which are filled before the kernel is executed. | ||
| ## | ||
| ## XXX: Document current not ideal behavior that one needs to be careful to pass data into | ||
| ## `wgsl.fakeExecute` precisely in the order in which the `var foo {.constant.}` are defined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that something fixable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's fixable. Mostly I wasn't sure about the cleanest solution and given that fakeExecute is not even part of this PR at the moment, I left that for the future to save time.
| proc add_ci(a: uint32, b: uint32): uint32 {.device.} = | ||
| # Add with carry in only. | ||
| # NOTE: `carry_flag` is not reset, because the next call after | ||
| # an `add_ci` *must* be `add_co` or `sub_bo`, but never |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better explanation is that add_ci consumes the carry flag so next operation must be independent from it. i.e. add is fine if for completely separate operands/purposes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point of the comment was that resetting the carry flag explicitly would be costly and only an issue if the operations are used in the wrong order.
Ideally we'd just set carry_flag = 0 at the end of this proc.
| ## If `typeOfIndex` is `true`, we return the type of the index we access. Otherwise | ||
| ## we return the type of the array / pointer. | ||
| ## | ||
| ## NOTE: Do *not* rely on this for `mutable` or `implicit` fields of pointer types! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mutable types are passed as nnkHiddenDeref but it doesn't seem like getInnerPointerType handles it hence my remark.
proc getInnerPointerType(n: NimNode, allowToFail: bool = false, allowArrayIdent: bool = false): GpuType =
doAssert n.typeKind in {ntyPtr, ntyPointer, ntyUncheckedArray, ntyVar} or n.kind == nnkPtrTy, "But was: " & $n.treerepr & " of typeKind " & $n.typeKind
if n.typeKind in {ntyPointer, ntyUncheckedArray}:
let typ = n.getTypeInst()
doAssert typ.kind == nnkBracketExpr, "No, was: " & $typ.treerepr
doAssert typ[0].kind in {nnkIdent, nnkSym}
doAssert typ[0].strVal in ["ptr", "UncheckedArray"]
result = nimToGpuType(typ[1], allowToFail, allowArrayIdent)
elif n.kind == nnkPtrTy:
result = nimToGpuType(n[0], allowToFail, allowArrayIdent)
elif n.kind == nnkAddr:
let typ = n.getTypeInst()
result = getInnerPointerType(typ, allowToFail, allowArrayIdent)
elif n.kind == nnkVarTy:
# VarTy
# Sym "BigInt"
result = nimToGpuType(n[0], allowToFail, allowArrayIdent)
elif n.kind == nnkSym: # symbol of e.g. `ntyVar`
result = nimToGpuType(n.getTypeInst(), allowToFail, allowArrayIdent)
else:
raiseAssert "Found what: " & $n.treerepr
|
The PR is in good shape so merging it to unblock Lita. Leftover TODOs:
|
Mutable types are passed as The It is true though that depending on what node you were to pass into |
This improves the initial WebGPU backend added in #564 to make it practically useful. Aside from adding a lot of features and bug fixes for the WGSL spec, it also makes changes towards making the entire
cudamacro more useful:static intarguments. We now could replace the templates that receive a static parameter by a regular generic, e.g. for theBigInttype and generate the correct functions and types based on what is actually instantiated. For the moment we don't update the code ingpu_field_ops.nimthough.cudablock anymore. We will pull any code used inside of the GPU block in, if it is called. This is a good step towards making code naturally run on CPU and GPU for example. Only the code called from the host still needs to be in thecudamacro. Note that keeping GPU code intemplatescan still be a good idea to not pollute the namespace and if one wants a clearer separation between GPU and CPU code.As a result also the CUDA backend was updated slightly to follow a similar structure as the WGSL backend. That is have a
preprocesspass before the actual codegen pass to scan the global functions for functions and types actually used.Other notable additions / changes:
-d:debugCudanow compiles the CUDA code with debug symbols. Useful if one wants to e.g. runcompute-sanitizerwithmemcheckorcuda-gdb.cuModuleGetGlobalbinding to usev2as wellx += y->x = x + yfoo.ptrFieldby whatever we assign in the constructor to that field in the code. Only storage buffers (which are global) are allowed to be assigned as such (local pointers are extremely limited anyway). As they are global, the replacement works everywhere. We error for invalid pointer handling.