[Bugfix] Register GGUF (ggml) ops so --quantization gguf works#263
Open
linkeLi0421 wants to merge 1 commit into
Open
[Bugfix] Register GGUF (ggml) ops so --quantization gguf works#263linkeLi0421 wants to merge 1 commit into
--quantization gguf works#263linkeLi0421 wants to merge 1 commit into
Conversation
The GGUF kernels in csrc/quantization/gguf/gguf_kernel.cu are compiled into
the _C extension (CMakeLists.txt lists gguf_kernel.cu), but their
TORCH_LIBRARY registrations were missing from csrc/torch_bindings.cpp. As a
result torch.ops._C.ggml_dequantize (and the other ggml_* ops) are never
registered with the dispatcher, so loading any GGUF model fails:
AttributeError: '_OpNamespace' '_C' object has no attribute
'ggml_dequantize'. Did you mean: 'awq_dequantize'?
awq/gptq kernels register both the symbol and the schema; ggml only had the
symbol. This adds the ggml ops.def/ops.impl block (matching upstream vLLM)
right after the GPTQ registrations, binding the schemas to the already-built
kernels. No kernel code changes.
Verified on a MetaX C500 (MACA 3.5.3): with the ggml ops registered,
Qwen2.5-0.5B-Instruct-GGUF (q4_k_m) loads and generates correctly via
--quantization gguf.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds Torch library bindings for GGML quantization kernels (dequantization, matrix-vector/matrix multiplication, and MoE variants) to the CUDA extension.
Changes:
- Register
ggml_dequantize,ggml_mul_mat_vec_a8, andggml_mul_mat_a8CUDA ops. - Register
ggml_moe_a8andggml_moe_a8_vecMoE CUDA ops. - Register
ggml_moe_get_block_sizeop (no backend dispatch key).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Contributor
There was a problem hiding this comment.
Code Review
This pull request registers several GGML-related operations to the Torch library bindings, including dequantization, matrix-vector multiplication, matrix-matrix multiplication, and Mixture of Experts (MoE) kernels. No review comments were provided, so there is no feedback to address.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #262.
Problem
--quantization gguffails on vllm-metax:csrc/quantization/gguf/gguf_kernel.cuis compiled into_C(it's inCMakeLists.txt) andcsrc/ops.hdeclares all sixggml_*functions — butcsrc/torch_bindings.cppnever registered them. AWQ/GPTQ register both symbol and schema; ggml only had the symbol, sotorch.ops._C.ggml_dequantizedoesn't exist.Fix
Add the
ggml_*ops.def/ops.implblock (matching upstream vLLM) after the GPTQ registrations. No kernel changes — the kernels were already built.Validation
On MetaX C500 (MACA 3.5.3.20, torch 2.8),
releases/v0.17.0, from-source build (USE_PRECOMPILED_KERNEL=0):torch.ops._C.ggml_dequantize--quantization ggufloadQwen2.5-0.5B-Instruct-GGUF(q4_k_m) loads and generates coherent output (EN + ZH) after the fix.Note for maintainers
The default runtime (
USE_PRECOMPILED_KERNEL=1) loads the precompiledmcoplibpackage, which has the same missing registration. This source fix only reaches theUSE_PRECOMPILED_KERNEL=0build —mcoplibneeds the sameggml_*registration restored for the default path. (The kernels are already inmcoplib; only theTORCH_LIBRARYregistration is missing.)