Summary
Loading any GGUF model (--quantization gguf) fails because the ggml_* custom ops are never registered with the PyTorch dispatcher:
AttributeError: '_OpNamespace' '_C' object has no attribute 'ggml_dequantize'.
Did you mean: 'awq_dequantize'?
...
RuntimeError: Engine core initialization failed.
Root cause
The GGUF kernels are compiled but not registered:
csrc/quantization/gguf/gguf_kernel.cu defines ggml_dequantize, ggml_mul_mat_a8, ggml_mul_mat_vec_a8, ggml_moe_a8, ggml_moe_a8_vec, ggml_moe_get_block_size.
CMakeLists.txt compiles gguf_kernel.cu into the _C extension, so the symbols are present in the .so.
- But
csrc/torch_bindings.cpp has no ggml_* ops.def/ops.impl lines. AWQ/GPTQ register both symbol and schema; ggml only got the symbol.
Result: torch.ops._C.ggml_dequantize does not exist, and vLLM's GGUF weight loader (vllm/model_executor/layers/quantization/gguf.py) crashes.
Reproduction (model-free, ~5s) on MetaX C500 / MACA 3.5.3.20
import torch, mcoplib._C
print(hasattr(torch.ops._C, "awq_dequantize")) # True
print(hasattr(torch.ops._C, "ggml_dequantize")) # False <-- bug
Binary evidence:
nm -D mcoplib/_C.abi3.so | grep ggml_dequantize # PRESENT (kernel compiled in)
strings mcoplib/_C.abi3.so | grep 'awq_dequantize(' # PRESENT (registered)
strings mcoplib/_C.abi3.so | grep 'ggml_dequantize(' # ABSENT (NOT registered)
Affected versions
Present in releases/v0.17.0, releases/v0.18.0, releases/v0.19.0, and master (all have 0 ggml registrations while still compiling gguf_kernel.cu).
Fix
Add the ggml_* ops.def/ops.impl block to csrc/torch_bindings.cpp (matching upstream vLLM). PR incoming.
Verified on MetaX C500 (MACA 3.5.3, torch 2.8): after registering, a from-source build (USE_PRECOMPILED_KERNEL=0) registers ggml_dequantize and Qwen2.5-0.5B-Instruct-GGUF (q4_k_m) loads and generates correctly.
Note: precompiled mcoplib
The default runtime (USE_PRECOMPILED_KERNEL=1) loads the precompiled mcoplib package, which has the same missing registration. The source fix only covers the USE_PRECOMPILED_KERNEL=0 build; mcoplib needs a matching rebuild for the default path.
Summary
Loading any GGUF model (
--quantization gguf) fails because theggml_*custom ops are never registered with the PyTorch dispatcher:Root cause
The GGUF kernels are compiled but not registered:
csrc/quantization/gguf/gguf_kernel.cudefinesggml_dequantize,ggml_mul_mat_a8,ggml_mul_mat_vec_a8,ggml_moe_a8,ggml_moe_a8_vec,ggml_moe_get_block_size.CMakeLists.txtcompilesgguf_kernel.cuinto the_Cextension, so the symbols are present in the.so.csrc/torch_bindings.cpphas noggml_*ops.def/ops.impllines. AWQ/GPTQ register both symbol and schema; ggml only got the symbol.Result:
torch.ops._C.ggml_dequantizedoes not exist, and vLLM's GGUF weight loader (vllm/model_executor/layers/quantization/gguf.py) crashes.Reproduction (model-free, ~5s) on MetaX C500 / MACA 3.5.3.20
Binary evidence:
Affected versions
Present in
releases/v0.17.0,releases/v0.18.0,releases/v0.19.0, andmaster(all have 0 ggml registrations while still compilinggguf_kernel.cu).Fix
Add the
ggml_*ops.def/ops.implblock tocsrc/torch_bindings.cpp(matching upstream vLLM). PR incoming.Verified on MetaX C500 (MACA 3.5.3, torch 2.8): after registering, a from-source build (
USE_PRECOMPILED_KERNEL=0) registersggml_dequantizeandQwen2.5-0.5B-Instruct-GGUF(q4_k_m) loads and generates correctly.Note: precompiled
mcoplibThe default runtime (
USE_PRECOMPILED_KERNEL=1) loads the precompiledmcoplibpackage, which has the same missing registration. The source fix only covers theUSE_PRECOMPILED_KERNEL=0build;mcoplibneeds a matching rebuild for the default path.