Tile-based GPU kernel experiments in Zig.
zigton explores a Triton-like programming model in plain Zig: one GPU block
owns one tile, kernel code performs tile operations, and the device API maps
those operations onto GPU threads.
Status: early prototype. NVIDIA/NVPTX only.
The project can compile Zig kernels to PTX, embed that PTX into a Zig host binary, load it through the CUDA Driver API, and launch kernels from Zig.
Current kernel build path:
kernel .zig file -> LLVM IR -> fix alias -> llc -> PTX -> @embedFile -> host
Kernel files are now modular. Each kernel source is compiled into its own PTX module, avoiding NVPTX object-linking complexity for now.
Phase 2 shared-memory and reduction primitives exist. Phase 3 is focused on host ergonomics, modular kernel files, same-file host/device experiments, and broader reduction APIs.
Device-side API:
GlobalPtr(T)/ConstGlobalPtr(T)RegTile(T, ept)backed by@Vector(ept, T)load/storewith strided, coalesced indexing- per-element tail masking against
n - multiple elements per thread via
EPT - tile + scalar
- tile + tile
requireBlocklaunch-geometry guardSharedTile(T, n)backed by staticaddrspace(.shared)storageblockSync()barrier wrapper lowering tobar.sync 0- device-side
blockReduceSum - device-side
blockReduceMax - device-side
blockReduce(.Add/.Max/.Min/.Mul, ...) - shared target helpers:
is_device,kernel_callconv
Host-side API:
ContextDeviceBuffer(T)ModuleKernelLaunchConfigkernelArgsReducerowning reduction kernel lookupReducer.sumF32/Reducer.maxF32/Reducer.minF32/Reducer.prodF32
Validated kernels/examples:
vector_addfilladd_scalaradd_const_tileadd_tileshared_copyblock_sumblock_maxblock_minblock_mulexamples/single_file.zigsame-file host/device launch
src/root.zig Public host API entrypoint
src/target.zig Shared host/device target helpers
src/main.zig Minimal executable entrypoint
src/host/root.zig Host CUDA API aggregate
src/host/context.zig CUDA context wrapper
src/host/buffer.zig DeviceBuffer(T)
src/host/module.zig PTX module loading
src/host/kernel.zig Kernel wrapper
src/host/launch.zig CUDA launch config
src/host/args.zig Kernel argument packing
src/host/reduce.zig Host reduction orchestration
src/host/utils.zig Host math/CUDA helpers
src/device/root.zig Device/kernel API aggregate
src/device/regtile.zig Register tile API
src/device/sharedtile.zig Shared tile API
src/device/reduce.zig Device block reductions
src/device/config.zig Shared THREADS / EPT / TILE constants
kernels/base.zig Base prototype kernels
kernels/reduce.zig Reduction kernels
examples/single_file.zig Same-file host/device experiment
tests/base.zig Base kernel integration tests
tests/reduce.zig Reduction integration tests
tools/fix_ptx_ir.sh LLVM IR alias rewrite for NVPTX
Host code imports:
const zt = @import("zigton");Device kernel code imports:
const zt = @import("zigton_device");- Zig with NVPTX support
- CUDA driver/runtime installed
- NVIDIA GPU
llcfrom LLVM new enough to parse Zig's emitted LLVM IR
Build only PTX:
zig build ptx \
-Dllc-path=/path/to/llc \
-Dgpu-arch=sm_89Run the app:
zig build run \
-Dllc-path=/path/to/llc \
-Dgpu-arch=sm_89 \
-Dcuda-prefix=/path/to/cudaRun tests:
zig build test \
-Dllc-path=/path/to/llc \
-Dgpu-arch=sm_89 \
-Dcuda-prefix=/path/to/cudaRun the same-file host/device experiment:
zig build single-file-example \
-Dllc-path=/path/to/llc \
-Dgpu-arch=sm_89 \
-Dcuda-prefix=/path/to/cudaKernel sources are registered in build.zig with addKernelFile:
const reduce_kernel = addKernelFile(b, .{
.name = "reduce",
.source = b.path("kernels/reduce.zig"),
.gpu_arch = gpu_arch,
.llc_path = llc_path,
.optimize = optimize,
});The helper compiles that source to PTX and returns a KernelFile:
.{
.name = "reduce",
.import_name = "reduce_ptx",
.ptx = ...,
}Embed it into a host module with:
addPtxImport(gpu_tests_mod, reduce_kernel.import_name, reduce_kernel.ptx);Then host code can load it normally:
const reduce_ptx: [:0]const u8 = @embedFile("reduce_ptx");
var module = try zt.Module.loadData(reduce_ptx);
defer module.deinit();
const block_sum = try module.kernel("block_sum");Current model: one PTX module per kernel file. This keeps kernel files modular without requiring NVPTX object linking.
examples/single_file.zig is compiled twice:
examples/single_file.zig -> nvptx64-cuda -> PTX
examples/single_file.zig -> native host test -> embeds PTX
This lets one Zig file contain both a kernel and its host-side launch test.
The pattern uses shared target helpers:
const builtin = @import("builtin");
const zt = if (builtin.target.cpu.arch == .nvptx64)
@import("zigton_device")
else
@import("zigton");
pub export fn single_file_fill(...) callconv(zt.kernel_callconv) void {
if (comptime !zt.is_device) return;
// device code
}This is not compiler-level offload. It is build-level double compilation plus explicit PTX embedding.
Zig currently emits NVPTX kernels through an alias pattern that llc rejects:
@vector_add = alias void (...), ptr @gpu.vector_add
define private ptx_kernel void @gpu.vector_add(...) { ... }tools/fix_ptx_ir.sh rewrites the IR so kernel definitions have public bare
names before llc lowers them to PTX.
Shared-memory kernels must keep barriers converged. A block barrier must be reached by every thread in a warp at the same program point; guarding work is safe, guarding the barrier is not.