WNF in CUDA


The idea for GPU implementation on HVM4 is to use deterministic sharding by index and address region. No work stealing, and each pass uses a frontier array. We use region ownership for writes, which means a block only writes inside it's region, and we defer any DUP that require writing outside the region.

### Regions

We partition the heap into fixed contiguous regions by address:

`region_id = loc / REGION_SIZE`

Where each region is assigned to exactly one block. Blocks can read any region, but only write inside their own. This avois cross-block atomics on shared DP slots, allow safe deferral for DUPs.

### Sharding

We use a frontier array per pass, and each pass processes the frontier and builds next:

```
frontier = [root_loc]
while frontier not empty:
  clear next_frontier (per region)
  launch kernel(frontier, next_frontier)
  frontier = concat(next_frontier[all regions])
```

Ordering is fixed, and the kernel is embarassingly parallel on the frontier array.

Example:
`@main = #Pair{#Pair{1, 2}, #Pair{3, 4}}`

Conceptual heap:
```
[0] := Pair (loc1, loc2)
[1] := Pair (loc3, loc4)
[2] := Pair (loc5, loc6]
[3] := 1
[4] := 2
[5] := 3
[6] := 4
```

Frontier passes:
- Pass 0:
frontier = [0]
kernel processes loc 0, enqueue children 1, 2.
next frontier = [1, 2]

- Pass 1:
frontier = [1, 2]
kernel processes loc1, enqueue 3, 4
kernel processes loc2, enqueue 5, 6
next frontier = [3,4,5,6]

- Pass 2
frontier = [3, 4, 5, 6]
kernel processes all, enqueue nothing, stops.

BFS-like expansion.

Each thread reduces one task using a local WNF step, and when it needs subterms, enqueues them to next frontier.

### Handling DUPs (Option 1)

We use a non-blocking `try_take` on the expression slot.

If this thread owns the slot's region, it may attempt to take and write `SUB`. Otherwise, it must defer to the owner if `SUB` is not already there.

This introduces a synchronization point across blocks for deferral, so it's not a zero communication option.

`try_take` must be atomic and non-blocking. A non-owner never writes, it defers if SUB is not present. If a non-owner sees `SUB`, it can read and continue reducing.

### Handling DUPs (Option 2)

If we want zero inter block messaging, we can drop the deferral and allow any thread to resolve DUP slots using global atomics. This way, we'd remove the region ownership for DUP slots, and any thread that reaches `DP0/DP1` may attempt `try_take` on the shared expression slot.

The winner writes `SUB`, losers read `SUB` and continue. This introduces higher contention on the hot DUP slots and the global atomic. 

It trades determinism and locality for zero messaging.

### Example: handling DUP with deferral

Term:

`@main = ! x &L= &L{1,2}; #Pair{ x0, #Pair{3, x1} }`

Heap:

```
[0]  := DUP (expr: 1, body: 2) : Block 0
[1]  := SUP (1, 2)             : Block 0
[2]  := Pair (loc 8, loc 16)   : Block 0
[8]  := DP0 (expr: 1)          : Block 1
[16] := Pair (loc 17, loc 18)  : Block 2
[17] := NUM 3                  : Block 2
[18] := DP1 (expr: 1)          : Block 2
```

Assuming 3 blocks, 4 threads per block, `region_size = 8`.

Pass 0 (frontier = [0]):
- `B0/t0` handles DUP, WNF returns body loc 2.
- Enqueue children of loc 2 (8, 16)

Pass 1:
`frontier_region_1 = [8]`
`frontier_region_2 = [16]`
- `B1/t0` handles loc 8 (DP0)
  - expr loc 1 is owned by `B0` and `SUB` is not present => defer loc 8 to region 0.

- `B2/t0` handles loc 16 (Pair)
  - enqueue loc 17 (NUM) and loc 18 (DP1)

Pass 2: 
`frontier_region_0 = [8]` // deferred DP0
`frontier_region_2 = [18]`
- `B0/t0` handles loc 8
  - owner of loc 1, `try_take` succeeds, writes `SUB` and `DP0` reduces to 1.
- `B2/t0` handles loc 18 
  - reads expr slot loc 1, SUB is there, `DP1` reduces to `2`

If `B2` reaches loc 18 before `B0` resolves loc 8, it would defer that loc again to `B0`, which would resolve it in the next pass.

There is no cross-block spinning.

### FFI and primitives

Inside of `%gnf`, if a primitive is encountered, abort the GPU execution for simplicity.

### GNF and NF semantics

%gnf:
When evaluating a program, CPU evaluates up to %gnf, then suspends until GPU finishes.

GPU runs `wnf/snf` on the given term with exclusive access to the heap. When completes, GPU resumes with the updated heap.

Nested %gnf calls are flattened (if already in GPU, threads as no op).

%nf:
Runs on CPU only. If called while the CPU is already running SNF with a thread pool, I think ideally we want to reuse the existing pool, because spawning a new one will mess up with memory.

# CUDA specifics

CUDA likewise cannot launch grids from device code in a portable way (dynamic parallelism exists but is not assumed here), so we mirror the Metal flow: **one pass per kernel launch**, with the CPU orchestrating the frontier loop.

Memory model differences:
- CUDA typically uses separate host/device memory, so `%gnf` must copy (or pre-stage) the subgraph into device memory before launching, and copy results back afterwards unless Unified Memory is used.
- Unified Memory can simplify this but may introduce page migration stalls; explicit `cudaMemcpy` often yields more predictable performance.

Atomics:
- 64-bit atomics are available on modern NVIDIA GPUs, but the exact capabilities depend on compute capability.
- If 64-bit atomics are missing or too slow, a split 2x32-bit encoding is required (out of scope here).

Kernel launch:
- Same frontier model: dispatch `frontier_len` threads, each runs WNF and appends children to `next_frontier` via atomic increments.
- Use one kernel per pass to avoid grid-wide synchronization problems.

Buffer sizing:
- Ensure `frontier` and `next_frontier` are sized for the worst-case breadth of the term (or use a spill policy).
- For large graphs, streaming multiple frontier segments is possible but adds overhead.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WNF in CUDA #33

Regions

Sharding

Handling DUPs (Option 1)

Handling DUPs (Option 2)

Example: handling DUP with deferral

FFI and primitives

GNF and NF semantics

CUDA specifics

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

WNF in CUDA #33

Description

Regions

Sharding

Handling DUPs (Option 1)

Handling DUPs (Option 2)

Example: handling DUP with deferral

FFI and primitives

GNF and NF semantics

CUDA specifics

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions