Skip to content

Add setup.py to build the _C extension (CUDA and ROCm)#12

Open
jeffdaily wants to merge 2 commits into
Luo-Yihao:mainfrom
jeffdaily:moat-port
Open

Add setup.py to build the _C extension (CUDA and ROCm)#12
jeffdaily wants to merge 2 commits into
Luo-Yihao:mainfrom
jeffdaily:moat-port

Conversation

@jeffdaily

@jeffdaily jeffdaily commented Jun 17, 2026

Copy link
Copy Markdown

The tree ships the _C sources (bindings.cpp, kernels.cu) and ops.py hard-imports them (from . import _C), but the only build config is a pyproject.toml that declares a pure-Python package with no extension module. A clean source install (pip install -e ., or pixi) therefore never compiles _C, so from . import _C fails. This adds the missing build wiring.

setup.py builds _C with PyTorch's CUDAExtension and BuildExtension. On a CUDA PyTorch it compiles the original CUDA sources unchanged; on a ROCm PyTorch BuildExtension hipifies the same sources automatically, so one source tree builds for both backends. This complements the ROCm kernel/runtime support already in main: that made the kernels HIP-clean, and this makes the extension actually build (on CUDA and ROCm alike).

The kernels use only atomicAdd, __syncthreads, dynamic shared memory and float math, with no warp-level intrinsics, so they are wavefront-size agnostic and need no per-architecture changes.

Two further changes make the build correct on Windows (both are no-ops on Linux and CUDA):

  • The int64 index/candidate buffers were typed long, which is 32-bit on Windows (LLP64) while torch::kInt64 tensors are 64-bit; the kernel signatures, data_ptr<>() calls and casts now use int64_t, which is correct on every platform and identical to long on Linux LP64.
  • setup.py adds a Windows-only /ALTERNATENAME link directive. c10.dll, built with clang-cl, does not export the c10::ValueError(SourceLocation, std::string) constructor inherited via using Error::Error;, so headers pulled in through <torch/extension.h> that expand TORCH_CHECK_VALUE fail to link (LNK2001); the directive aliases the missing import thunk to the exported c10::Error(SourceLocation, std::string) constructor. The same root cause was fixed upstream in [c10] Fix missing symbol exports for ValueError/NotImplementedError on Windows pytorch/pytorch#175340 (explicit exported constructors for the affected c10::Error subclasses); this alias keeps the extension building on PyTorch releases from before that fix.

Building

# CUDA (unchanged)
pip install -e . --no-build-isolation

# ROCm (set the arch(es) for your GPU)
PYTORCH_ROCM_ARCH=gfx90a pip install -e . --no-build-isolation

The README's Manual Setup section documents the ROCm path alongside the existing CUDA instructions. .gitignore is extended to cover the hipify build artifacts (*.hip, *.prehip, *.so.*).

Validation

Built on an AMD Instinct MI250X (gfx90a, ROCm 7.2): a clean setup.py build_ext hipifies, compiles and links _C (the .so carries a native gfx90a code object). A synthetic-tensor harness drives all four _C bindings on the GPU and compares against a pure-torch CPU reference (the atomicAdd output-slotting kernels as order-independent (a, t) pair sets, the deterministic kernels exactly and for rerun stability); all checks pass, with Moller-Trumbore dot-product drift of 3.5e-7 within the kernels' eps thresholds. A gfx90a;gfx1100 multi-architecture binary also builds with both code objects present.

The end-to-end demo additionally depends on atom3d and torch_scatter on the GPU; bringing those up on ROCm is left as a follow-up, so this change covers the _C kernel layer those higher-level paths call into.

The faithcontour._C extension has no build wiring on main: there is no
setup.py and pyproject declares only a pure-Python package, so a source
install never compiles _C and "from . import _C" fails at import. This
adds a setup.py with a torch CUDAExtension/BuildExtension over
_C/{bindings.cpp,kernels.cu}. On a CUDA PyTorch it builds the original
CUDA sources unchanged; on a ROCm PyTorch BuildExtension hipifies the
same sources automatically, so the extension builds for AMD GPUs with no
source changes. This complements the existing ROCm kernel/runtime
support already on main by providing the missing compiled artifact.

setup.py also carries a Windows-only /ALTERNATENAME linker directive:
c10.dll does not export the inherited c10::ValueError(SourceLocation,
string) constructor that <torch/extension.h> references, so the import
thunk is aliased to the exported c10::Error(SourceLocation, string)
(ValueError IS-A Error with no extra data members).

kernels.cu converts the int64 index and candidate buffer types and casts
from long to int64_t. This is a no-op on LP64 Linux (long == int64_t) but
is required on Windows LLP64 where long is 32-bit while the torch int64
tensors backing these buffers are 64-bit.

.gitignore adds the hipify byproducts (*.hip, *.prehip) and versioned
shared objects (*.so.*) so a ROCm build leaves the tree clean.

This work was authored with the assistance of Claude, an AI assistant.

Test Plan:
```
rm -f src/faithcontour/_C/kernels.hip src/faithcontour/_C/*.prehip
rm -rf build
cd src && HIP_VISIBLE_DEVICES=0 PYTORCH_ROCM_ARCH=gfx90a \
    python setup.py build_ext --inplace
HIP_VISIBLE_DEVICES=0 python3 agent_space/faithc_harness.py
```
Built cleanly on gfx90a (AMD Instinct MI250X, ROCm 7.2); the harness
drives all four _C bindings on GPU against a torch CPU reference and
reports all checks PASS.
@jeffdaily jeffdaily changed the title Add AMD GPU (ROCm) support for the faithcontour._C extension Add setup.py to build the _C extension (CUDA and ROCm) Jun 17, 2026
PyTorch's BuildExtension on Windows adds .cu/.cuh to the MSVC compiler
driver's _cpp_extensions list so the spawn wrapper can intercept those
files and route them to hipcc instead of cl.exe. However it does not add
.hip. After a PyTorch update (torch 2.9.1+rocm7.14, Jun 2026), the hipify
step renames kernels.cu to kernels.hip before MSVC's compile loop runs,
and the MSVC driver raises "Don't know how to compile *.hip" because .hip
is absent from _cpp_extensions. Fix by subclassing BuildExtension and
appending .hip to _cpp_extensions on Windows before delegating to the
parent, which installs the spawn wrapper that routes .hip -> hipcc.

This fix is Windows-only (the guard checks sys.platform == "win32" and
the hasattr guard is a no-op on Linux where clang is the host compiler).

Authored with the assistance of Claude, an AI assistant.

Test Plan:
```
export PATH="/c/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/VC/Tools/MSVC/14.44.35207/bin/HostX64/x64:$PATH"
VENV=/b/develop/TheRock/external-builds/pytorch/.venv
rm -f src/faithcontour/_C/kernels.hip && rm -rf build/
HIP_VISIBLE_DEVICES=0 PYTORCH_ROCM_ARCH=gfx1201 \
  ROCM_HOME=$VENV/Lib/site-packages/_rocm_sdk_devel \
  DISTUTILS_USE_SDK=1 \
  $VENV/Scripts/python.exe setup.py build_ext --inplace
HIP_VISIBLE_DEVICES=0 $VENV/Scripts/python.exe agent_space/faithc_harness_win.py
```
Built for gfx1201 (AMD Radeon RX 9070 XT, RDNA4, Windows 11); the harness
reports 17/17 PASS on all four _C kernel bindings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant