Add setup.py to build the _C extension (CUDA and ROCm)#12
Open
jeffdaily wants to merge 2 commits into
Open
Conversation
The faithcontour._C extension has no build wiring on main: there is no
setup.py and pyproject declares only a pure-Python package, so a source
install never compiles _C and "from . import _C" fails at import. This
adds a setup.py with a torch CUDAExtension/BuildExtension over
_C/{bindings.cpp,kernels.cu}. On a CUDA PyTorch it builds the original
CUDA sources unchanged; on a ROCm PyTorch BuildExtension hipifies the
same sources automatically, so the extension builds for AMD GPUs with no
source changes. This complements the existing ROCm kernel/runtime
support already on main by providing the missing compiled artifact.
setup.py also carries a Windows-only /ALTERNATENAME linker directive:
c10.dll does not export the inherited c10::ValueError(SourceLocation,
string) constructor that <torch/extension.h> references, so the import
thunk is aliased to the exported c10::Error(SourceLocation, string)
(ValueError IS-A Error with no extra data members).
kernels.cu converts the int64 index and candidate buffer types and casts
from long to int64_t. This is a no-op on LP64 Linux (long == int64_t) but
is required on Windows LLP64 where long is 32-bit while the torch int64
tensors backing these buffers are 64-bit.
.gitignore adds the hipify byproducts (*.hip, *.prehip) and versioned
shared objects (*.so.*) so a ROCm build leaves the tree clean.
This work was authored with the assistance of Claude, an AI assistant.
Test Plan:
```
rm -f src/faithcontour/_C/kernels.hip src/faithcontour/_C/*.prehip
rm -rf build
cd src && HIP_VISIBLE_DEVICES=0 PYTORCH_ROCM_ARCH=gfx90a \
python setup.py build_ext --inplace
HIP_VISIBLE_DEVICES=0 python3 agent_space/faithc_harness.py
```
Built cleanly on gfx90a (AMD Instinct MI250X, ROCm 7.2); the harness
drives all four _C bindings on GPU against a torch CPU reference and
reports all checks PASS.
PyTorch's BuildExtension on Windows adds .cu/.cuh to the MSVC compiler driver's _cpp_extensions list so the spawn wrapper can intercept those files and route them to hipcc instead of cl.exe. However it does not add .hip. After a PyTorch update (torch 2.9.1+rocm7.14, Jun 2026), the hipify step renames kernels.cu to kernels.hip before MSVC's compile loop runs, and the MSVC driver raises "Don't know how to compile *.hip" because .hip is absent from _cpp_extensions. Fix by subclassing BuildExtension and appending .hip to _cpp_extensions on Windows before delegating to the parent, which installs the spawn wrapper that routes .hip -> hipcc. This fix is Windows-only (the guard checks sys.platform == "win32" and the hasattr guard is a no-op on Linux where clang is the host compiler). Authored with the assistance of Claude, an AI assistant. Test Plan: ``` export PATH="/c/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/VC/Tools/MSVC/14.44.35207/bin/HostX64/x64:$PATH" VENV=/b/develop/TheRock/external-builds/pytorch/.venv rm -f src/faithcontour/_C/kernels.hip && rm -rf build/ HIP_VISIBLE_DEVICES=0 PYTORCH_ROCM_ARCH=gfx1201 \ ROCM_HOME=$VENV/Lib/site-packages/_rocm_sdk_devel \ DISTUTILS_USE_SDK=1 \ $VENV/Scripts/python.exe setup.py build_ext --inplace HIP_VISIBLE_DEVICES=0 $VENV/Scripts/python.exe agent_space/faithc_harness_win.py ``` Built for gfx1201 (AMD Radeon RX 9070 XT, RDNA4, Windows 11); the harness reports 17/17 PASS on all four _C kernel bindings.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The tree ships the
_Csources (bindings.cpp,kernels.cu) andops.pyhard-imports them (from . import _C), but the only build config is apyproject.tomlthat declares a pure-Python package with no extension module. A clean source install (pip install -e ., or pixi) therefore never compiles_C, sofrom . import _Cfails. This adds the missing build wiring.setup.pybuilds_Cwith PyTorch'sCUDAExtensionandBuildExtension. On a CUDA PyTorch it compiles the original CUDA sources unchanged; on a ROCm PyTorchBuildExtensionhipifies the same sources automatically, so one source tree builds for both backends. This complements the ROCm kernel/runtime support already inmain: that made the kernels HIP-clean, and this makes the extension actually build (on CUDA and ROCm alike).The kernels use only
atomicAdd,__syncthreads, dynamic shared memory and float math, with no warp-level intrinsics, so they are wavefront-size agnostic and need no per-architecture changes.Two further changes make the build correct on Windows (both are no-ops on Linux and CUDA):
long, which is 32-bit on Windows (LLP64) whiletorch::kInt64tensors are 64-bit; the kernel signatures,data_ptr<>()calls and casts now useint64_t, which is correct on every platform and identical tolongon Linux LP64.setup.pyadds a Windows-only/ALTERNATENAMElink directive.c10.dll, built with clang-cl, does not export thec10::ValueError(SourceLocation, std::string)constructor inherited viausing Error::Error;, so headers pulled in through<torch/extension.h>that expandTORCH_CHECK_VALUEfail to link (LNK2001); the directive aliases the missing import thunk to the exportedc10::Error(SourceLocation, std::string)constructor. The same root cause was fixed upstream in [c10] Fix missing symbol exports for ValueError/NotImplementedError on Windows pytorch/pytorch#175340 (explicit exported constructors for the affectedc10::Errorsubclasses); this alias keeps the extension building on PyTorch releases from before that fix.Building
The README's Manual Setup section documents the ROCm path alongside the existing CUDA instructions.
.gitignoreis extended to cover the hipify build artifacts (*.hip,*.prehip,*.so.*).Validation
Built on an AMD Instinct MI250X (gfx90a, ROCm 7.2): a clean
setup.py build_exthipifies, compiles and links_C(the.socarries a native gfx90a code object). A synthetic-tensor harness drives all four_Cbindings on the GPU and compares against a pure-torch CPU reference (theatomicAddoutput-slotting kernels as order-independent(a, t)pair sets, the deterministic kernels exactly and for rerun stability); all checks pass, with Moller-Trumbore dot-product drift of 3.5e-7 within the kernels' eps thresholds. Agfx90a;gfx1100multi-architecture binary also builds with both code objects present.The end-to-end demo additionally depends on
atom3dandtorch_scatteron the GPU; bringing those up on ROCm is left as a follow-up, so this change covers the_Ckernel layer those higher-level paths call into.