Skip to content

[Bugfix] Register GGUF (ggml) ops so --quantization gguf works#263

Open
linkeLi0421 wants to merge 1 commit into
MetaX-MACA:masterfrom
linkeLi0421:fix-gguf-ggml-op-registration
Open

[Bugfix] Register GGUF (ggml) ops so --quantization gguf works#263
linkeLi0421 wants to merge 1 commit into
MetaX-MACA:masterfrom
linkeLi0421:fix-gguf-ggml-op-registration

Conversation

@linkeLi0421
Copy link
Copy Markdown

Fixes #262.

Problem

--quantization gguf fails on vllm-metax:

AttributeError: '_OpNamespace' '_C' object has no attribute 'ggml_dequantize'.
Did you mean: 'awq_dequantize'?

csrc/quantization/gguf/gguf_kernel.cu is compiled into _C (it's in CMakeLists.txt) and csrc/ops.h declares all six ggml_* functions — but csrc/torch_bindings.cpp never registered them. AWQ/GPTQ register both symbol and schema; ggml only had the symbol, so torch.ops._C.ggml_dequantize doesn't exist.

Fix

Add the ggml_* ops.def/ops.impl block (matching upstream vLLM) after the GPTQ registrations. No kernel changes — the kernels were already built.

Validation

On MetaX C500 (MACA 3.5.3.20, torch 2.8), releases/v0.17.0, from-source build (USE_PRECOMPILED_KERNEL=0):

before after
torch.ops._C.ggml_dequantize missing registered
--quantization gguf load crashes works

Qwen2.5-0.5B-Instruct-GGUF (q4_k_m) loads and generates coherent output (EN + ZH) after the fix.

Note for maintainers

The default runtime (USE_PRECOMPILED_KERNEL=1) loads the precompiled mcoplib package, which has the same missing registration. This source fix only reaches the USE_PRECOMPILED_KERNEL=0 build — mcoplib needs the same ggml_* registration restored for the default path. (The kernels are already in mcoplib; only the TORCH_LIBRARY registration is missing.)

The GGUF kernels in csrc/quantization/gguf/gguf_kernel.cu are compiled into
the _C extension (CMakeLists.txt lists gguf_kernel.cu), but their
TORCH_LIBRARY registrations were missing from csrc/torch_bindings.cpp. As a
result torch.ops._C.ggml_dequantize (and the other ggml_* ops) are never
registered with the dispatcher, so loading any GGUF model fails:

    AttributeError: '_OpNamespace' '_C' object has no attribute
    'ggml_dequantize'. Did you mean: 'awq_dequantize'?

awq/gptq kernels register both the symbol and the schema; ggml only had the
symbol. This adds the ggml ops.def/ops.impl block (matching upstream vLLM)
right after the GPTQ registrations, binding the schemas to the already-built
kernels. No kernel code changes.

Verified on a MetaX C500 (MACA 3.5.3): with the ggml ops registered,
Qwen2.5-0.5B-Instruct-GGUF (q4_k_m) loads and generates correctly via
--quantization gguf.
Copilot AI review requested due to automatic review settings May 30, 2026 15:08
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds Torch library bindings for GGML quantization kernels (dequantization, matrix-vector/matrix multiplication, and MoE variants) to the CUDA extension.

Changes:

  • Register ggml_dequantize, ggml_mul_mat_vec_a8, and ggml_mul_mat_a8 CUDA ops.
  • Register ggml_moe_a8 and ggml_moe_a8_vec MoE CUDA ops.
  • Register ggml_moe_get_block_size op (no backend dispatch key).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request registers several GGML-related operations to the Torch library bindings, including dequantization, matrix-vector multiplication, matrix-matrix multiplication, and Mixture of Experts (MoE) kernels. No review comments were provided, so there is no feedback to address.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] --quantization gguf fails: ggml ops compiled but not registered (torch.ops._C.ggml_dequantize missing)

2 participants