Fast kernel library for Diffusion inference with multiple compute backends.
| Function | eager | cuda | triton |
|---|---|---|---|
quantize_per_tensor_fp8 |
✓ | ✓ | ✓ |
dequantize_per_tensor_fp8 |
✓ | ✓ | ✓ |
quantize_nvfp4 |
✓ | ✓ | ✓ |
dequantize_nvfp4 |
✓ | ✓ | |
scaled_mm_nvfp4 |
✓ | ✓ | |
apply_rope |
✓ | ✓ | ✓ |
apply_rope1 |
✓ | ✓ | ✓ |
The library provides QuantizedTensor, a torch.Tensor subclass that transparently intercepts PyTorch operations and dispatches them to optimized quantized kernels when available.
| Layout | Format | HW Requirement | Description |
|---|---|---|---|
TensorCoreFP8Layout |
FP8 E4M3 | SM ≥ 8.9 (Ada) | Per-tensor scaling, 1:1 element mapping |
TensorCoreNVFP4Layout |
NVFP4 E2M1 | SM ≥ 10.0 (Blackwell) | Block quantization with 16-element blocks |
from comfy_kitchen.tensor import QuantizedTensor, TensorCoreFP8Layout, TensorCoreNVFP4Layout
# Quantize a tensor
x = torch.randn(128, 256, device="cuda", dtype=torch.bfloat16)
qt = QuantizedTensor.from_float(x, TensorCoreFP8Layout)
# Operations dispatch to optimized kernels automatically
output = torch.nn.functional.linear(qt, weight_qt)
# Dequantize back to float
dq = qt.dequantize()# Install default (Linux/Windows/MacOS)
pip install comfy-kitchen
# Install with CUBLAS for NVFP4 (+Blackwell)
pip install comfy-kitchen[cublas]- CUDA wheels: Linux x86_64 and Windows x64
- Pure Python wheel: Any platform, eager and triton backends only
Wheels are built for Python 3.10, 3.11, and 3.12+ (using Stable ABI for 3.12+).
# Standard installation with CUDA support
pip install .
# Development installation
pip install -e ".[dev]"
# For faster rebuilds during development (skip build isolation)
pip install -e . --no-build-isolation -vThese options require using setup.py directly (not pip install):
| Option | Command | Description | Default |
|---|---|---|---|
--no-cuda |
python setup.py bdist_wheel --no-cuda |
Build CPU-only wheel (py3-none-any) |
Enabled (build with CUDA) |
--cuda-archs=... |
python setup.py build_ext --cuda-archs="80;89" |
CUDA architectures to build for | 75-virtual;80;89;90a;100f;120f (Linux), 75-virtual;80;89;120f (Windows) |
--debug-build |
python setup.py build_ext --debug-build |
Build in debug mode with symbols | Disabled (Release) |
--lineinfo |
python setup.py build_ext --lineinfo |
Enable NVCC line info for profiling | Disabled |
# Build CPU-only wheel (pure Python, no CUDA required)
python setup.py bdist_wheel --no-cuda
# Build with custom CUDA architectures
python setup.py build_ext --cuda-archs="80;89" bdist_wheel
# Debug build with line info for profiling
python setup.py build_ext --debug-build --lineinfo bdist_wheel- Python: ≥3.10
- PyTorch: ≥2.5.0
- CUDA Runtime (for CUDA wheels): ≥13.0
- Pre-built wheels require NVIDIA Driver r580+
- Building from source requires CUDA Toolkit ≥12.8 and
CUDA_HOMEenvironment variable
- nanobind: ≥2.0.0 (for building from source)
- CMake: ≥3.18 (for building from source)
import comfy_kitchen as ck
import torch
# Automatic backend selection (triton -> cuda -> eager)
x = torch.randn(100, 100, device="cuda")
scale = torch.tensor([1.0], device="cuda")
result = ck.quantize_per_tensor_fp8(x, scale)
# Check which backends are available
print(ck.list_backends())
# Force a specific backend
result = ck.quantize_per_tensor_fp8(x, scale, backend="eager")
# Temporarily use a different backend
with ck.use_backend("triton"):
result = ck.quantize_per_tensor_fp8(x, scale)The library supports multiple backends:
- eager: Pure PyTorch implementation
- cuda: Custom CUDA C kernels (CUDA only)
- triton: Triton JIT-compiled kernels
When you call a function, the registry selects the best backend by checking constraints in priority order (cuda → triton → eager):
# Backend is selected automatically based on input constraints
result = ck.quantize_per_tensor_fp8(x, scale)
# On CPU tensors → falls back to eager (only backend supporting CPU)
# On CUDA tensors → uses cuda or triton (higher priority)Each backend declares constraints for its functions:
| Constraint | Description |
|---|---|
| Device | Which device types are supported |
| Dtype | Allowed input/output dtypes per parameter |
| Shape | Shape requirements (e.g., 2D tensors, dimensions divisible by 16) |
| Compute Capability | Minimum GPU architecture (e.g., SM 8.0 for FP8, SM 10.0 for NVFP4) |
The registry validates inputs against these constraints before calling the backend—no try/except fallback patterns. If no backend can handle the inputs, a NoCapableBackendError is raised with details.
# Debug logging to see backend selection
import logging
logging.getLogger("comfy_kitchen.dispatch").setLevel(logging.DEBUG)Run the test suite with pytest:
# Run all tests
pytest
# Run specific test file
pytest tests/test_backends.py
# Run with verbose output
pytest -v
# Run specific test
pytest tests/test_backends.py::TestBackendSystem::test_list_backends