Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR introduces comprehensive support for the Qwen3.5 language model, including model configuration, architecture with hybrid attention mechanisms (full attention and GDN layers), tokenization, and a CLI example. It also extends the QNN AOT backend with new lowering patterns for SiLU and RoPE operations. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.Change the |
There was a problem hiding this comment.
Actionable comments posted: 12
🧹 Nitpick comments (3)
examples/qwen3_5/main.cpp (2)
55-64: Consider resetting KV cache before inference if multi-turn support is added.Based on the
ARGenerationimplementation (seemllm/models/ARGeneration.cpp:445-447),chat()does not reset the KV cache. For single-prompt usage this works, but if you add a loop, you'll need to callmodel.kvCache().reset()between independent prompts to avoid stale cached states affecting new responses.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/qwen3_5/main.cpp` around lines 55 - 64, The example currently calls model.chat(inputs) directly and ARGeneration::chat does not reset the KV cache; to avoid stale state when adding multi-turn or repeated independent prompts, call model.kvCache().reset() before each independent inference (e.g., right before invoking model.chat(inputs) in the try block). Reference the ARGeneration behavior and use model.kvCache().reset() to clear the cache between prompts so each new prompt starts with a fresh KV state.
47-64: UI text is misleading: "Interactive CLI" but only single prompt.The banner says "Qwen3.5 Interactive CLI" and mentions "Enter 'exit' or 'quit' to end the session", but the code only processes one prompt and exits. This may confuse users expecting a multi-turn session.
Either add a loop for actual interactive behavior or update the messaging.
♻️ Proposed fix for multi-turn interaction
fmt::print("\n{:*^60}\n", " Qwen3.5 Interactive CLI "); fmt::print("Enter 'exit' or 'quit' to end the session\n\n"); std::string prompt_text; + while (true) { + fmt::print("Prompt text (or 'exit/quit'): "); + std::getline(std::cin, prompt_text); + + if (prompt_text == "exit" || prompt_text == "quit" || std::cin.eof()) { + break; + } - fmt::print("Prompt text (or 'exit/quit'): "); - std::getline(std::cin, prompt_text); - - try { - fmt::print("Processing...\n"); - auto inputs = tokenizer.convertMessage({.prompt = prompt_text}); + try { + fmt::print("Processing...\n"); + model.kvCache().reset(); // Reset KV cache between turns + auto inputs = tokenizer.convertMessage({.prompt = prompt_text}); - fmt::print("\nResponse: "); + fmt::print("\nResponse: "); - for (auto& step : model.chat(inputs)) { std::wcout << tokenizer.detokenize(step.cur_token_id) << std::flush; } + for (auto& step : model.chat(inputs)) { std::wcout << tokenizer.detokenize(step.cur_token_id) << std::flush; } - fmt::print("\n{}\n", std::string(60, '-')); - } catch (const std::exception& e) { fmt::print("\nError: {}\n{}\n", e.what(), std::string(60, '-')); } + fmt::print("\n{}\n", std::string(60, '-')); + } catch (const std::exception& e) { fmt::print("\nError: {}\n{}\n", e.what(), std::string(60, '-')); } + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/qwen3_5/main.cpp` around lines 47 - 64, The current UI displays "Qwen3.5 Interactive CLI" and instructs to use 'exit'/'quit' but only reads a single prompt; change to a real interactive loop: wrap the prompt/read/process/response block in a while loop that repeatedly prints "Prompt text (or 'exit/quit'): ", reads prompt_text, breaks when prompt_text is "exit" or "quit", and otherwise calls tokenizer.convertMessage(...) and iterates model.chat(...) to stream responses (use the existing tokenizer.detokenize and step.cur_token_id usage). Ensure exceptions are caught per-iteration and the prompt/response separator (std::string(60,'-')) is printed after each exchange so multi-turn behavior matches the banner.mllm/models/qwen3_5/modeling_qwen3_5.hpp (1)
537-546: KV cache allocates slots for all layers, but only full attention layers use it.The
StaticCacheis initialized withcfg.num_hidden_layers(all 24 layers), but only the 6 full attention layers use KV caching. GDN layers (line 438) explicitly don't use the cache. This wastes memory allocating 18 unused cache slots.♻️ Suggested optimization
Consider only allocating cache for full attention layers:
- kv_cache_ = nn::StaticCache(cfg.max_cache_length, cfg.num_hidden_layers, + kv_cache_ = nn::StaticCache(cfg.max_cache_length, cfg.numFullAttentionLayers(), cfg.num_attention_heads,This would also require updating
layer_idx_in full attention layers to be an index into the reduced cache rather than the global layer index.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm/models/qwen3_5/modeling_qwen3_5.hpp` around lines 537 - 546, The constructor Qwen3_5ForCausalLM currently initializes kv_cache_ with cfg.num_hidden_layers causing slots for GDN layers that never use the cache; change the allocation to use only the number of full-attention layers (compute count of layers where attention is enabled) when constructing nn::StaticCache and adjust any code that sets layer_idx_ in those full-attention layer objects so layer_idx_ becomes an index into this reduced cache rather than the original global layer index (update places that reference layer_idx_ to use the new mapping). Ensure the cache creation uses the same cfg.max_cache_length, cfg.num_attention_heads, cfg.num_key_value_heads, head_dim and types as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/qwen3_5/main.cpp`:
- Around line 30-34: The help-check currently runs after Argparse::parse(argc,
argv) which will fail if required(true) arguments are missing; modify the flow
so help is detected before required-arg validation by either (a) checking argv
for "-h" or "--help" before calling Argparse::parse, or (b) using a parser
option/alternate parse method that allows short-circuiting validation when help
is present; update the code around Argparse::parse(argc, argv), help.isSet(),
and the subsequent Argparse::printHelp()/mllm::shutdownContext() so that
printHelp() is invoked and the program exits without triggering
required-argument errors.
In `@mllm/backends/qnn/aot/visitor/RoPE.cpp`:
- Around line 45-52: The code in RoPE.cpp dereferences op->inputs() and
op->outputs() without checking arity (uses inputs_it, i_0, i_sin, i_cos and
o_0); add explicit validation that op->inputs().size() >= 3 and
op->outputs().size() >= 1 before creating inputs_it or dereferencing std::next,
and handle the error path (return, throw, or log and abort) consistent with
surrounding error handling conventions so you never dereference invalid
iterators for the RoPE operator.
- Around line 57-60: The lowering currently hardcodes pose_type=0 in RoPE.cpp
when calling qnn_op_node->emplaceParamScalar, but the custom kernel expects
pose_type 2 or 4; add a pose_type field to the IR struct/class RoPEOpOptions (or
derive it deterministically from an existing field like rope_theta) and populate
it from the op creation site, then replace the hardcoded
static_cast<uint32_t>(0) in RoPE.cpp with the value from RoPEOpOptions (passed
through into QNNParamScalarWrapper::create("pose_type",
static_cast<uint32_t>(pose_type))). Ensure the new field is
serialized/constructed where RoPEOpOptions is instantiated so the lowering can
read it.
In `@mllm/backends/qnn/aot/visitor/SiLU.cpp`:
- Around line 44-47: The SiLU visitor currently copies quant_recipe only if
present which is inconsistent with other QNN AOT patterns; update the SiLU
handling to require the quantization recipe the same way RMSNorm/Add/Mul/Neg do
by checking for the attribute with MLLM_RETURN_FALSE_IF_NOT (or equivalent) on
op->getAttr("quant_recipe") and then unconditionally set it on sigmoid_out via
sigmoid_out->setAttr("quant_recipe", op->getAttr("quant_recipe")); this enforces
presence of quant metadata and matches existing QNN patterns.
In `@mllm/models/qwen3_5/configuration_qwen3_5.hpp`:
- Around line 115-117: The method isFullAttentionLayer(int layer_idx) performs
unchecked indexing into layer_types; guard against out-of-range and negative
indexes by validating layer_idx before accessing layer_types: check that
layer_idx >= 0 and layer_idx < static_cast<int>(layer_types.size()) and return a
safe default (e.g., false) or handle the error (assert/log/throw) if the index
is invalid; update isFullAttentionLayer to perform this bounds check against
layer_types.size() and then compare layer_types[layer_idx] == "full_attention".
- Around line 34-61: The code computes layer_types using (i + 1) %
full_attention_interval which is unsafe when full_attention_interval <= 0 and it
accepts provided layer_types without validating length or allowed values; update
initialization to first validate full_attention_interval is > 0 (otherwise set
to 1 or return an error) before using it in the loop, and when
tc.contains("layer_types") verify that layer_types.size() == num_hidden_layers
and each entry is one of the allowed strings (e.g., "full_attention" or
"linear_attention")—if validation fails, either normalize/replace invalid
entries or emit an error so downstream helpers do not break. Ensure these checks
reference the variables full_attention_interval, layer_types, and
num_hidden_layers and are performed before deriving the computed layer layout.
In `@mllm/models/qwen3_5/modeling_qwen3_5.hpp`:
- Around line 553-555: The logic for tie_word_embeddings is inverted: instead of
creating and using lm_head_ when cfg.tie_word_embeddings is true, create
lm_head_ only when cfg.tie_word_embeddings is false (or alternatively always
create but copy/alias embedding weights into lm_head_ when
cfg.tie_word_embeddings is true). Update the construction in the constructor
around lm_head_ (currently guarded by cfg.tie_word_embeddings) to be guarded by
!cfg.tie_word_embeddings (or add weight sharing code to copy embedding->lm_head_
when tying), and update the forward usage (the branch keyed by
tie_word_embeddings_) to use lm_head_ only when embeddings are not tied (or use
the shared embedding weights when tying). Ensure references to lm_head_,
cfg.tie_word_embeddings and tie_word_embeddings_ are consistent after the
change.
- Around line 360-363: The attention code omits the softmax step: after
computing attn_weights = matmul(q, k, false, true) and scaling by
1/sqrt(head_k_dim_), apply a softmax over the source sequence dimension (the S
dimension of attn_weights) before multiplying by v; update the sequence that
builds attn_weights (using nn::functional::softmax or equivalent) so
attn_weights becomes a probability distribution prior to the subsequent matmul
with v, keeping existing variables q, k, v, head_k_dim_, and the final output =
matmul(attn_weights, v).
- Around line 339-363: The simplified GDN attention path is missing the causal
mask and softmax, breaking autoregression; in the GDN fallback (where q, k, v
are transposed and attn_weights is computed) apply the same causal masking used
by Qwen3_5FullAttention (e.g., reuse the class CausalMask member or call
mask_(attn_weights)) after scaling and then run softmax before multiplying by v,
ensuring attn_weights becomes a masked softmax(QK^T / sqrt(d)) and preserving
the autoregressive property.
In `@mllm/models/qwen3_5/tokenization_qwen3_5.hpp`:
- Around line 226-229: In convertMessage, guard against std::string::npos
returned by Qwen3_5Message::message_template.find("{{{prompt}}}") before calling
replace: check the found position (pos) and if it is npos handle the missing
placeholder (for example, log/throw/return an error or append the prompt into
the template) instead of calling applied_string.replace(pos, 12,
message.prompt); this prevents undefined behavior when the "{{{prompt}}}" token
is absent.
- Around line 206-211: In detokenize, avoid using
bytes_2_unicode_dict_inverse_[c] (operator[]) which inserts defaults for unknown
wchar_t; instead, validate lookups from bytes_2_unicode_dict_inverse_ for each
wchar_t returned by _detokenize and handle misses explicitly (e.g., use find()
or at() with try/catch) — on missing mapping append a safe UTF-8 replacement
sequence (such as the UTF-8 bytes for U+FFFD or a question mark) to utf_8_str or
return an error; update the function around _detokenize, utf_8_str, and
bytes_2_unicode_dict_inverse_ to perform this check before converting with
mllm::preprocessor::utf8string2WideString.
- Around line 103-113: The lookahead logic for pattern 6 is wrong: after
consuming all whitespace in the while loop inside the tokenization code (the
block that sets start and advances pos), the subsequent check should only accept
the match when we reached end-of-string (trailing whitespace), because next char
after the loop will never be whitespace; update the condition in that block to
test only for pos >= str.size() (remove the std::iswspace(str[pos]) check) so
matched = str.substr(start, pos-start) is returned only for trailing whitespace.
---
Nitpick comments:
In `@examples/qwen3_5/main.cpp`:
- Around line 55-64: The example currently calls model.chat(inputs) directly and
ARGeneration::chat does not reset the KV cache; to avoid stale state when adding
multi-turn or repeated independent prompts, call model.kvCache().reset() before
each independent inference (e.g., right before invoking model.chat(inputs) in
the try block). Reference the ARGeneration behavior and use
model.kvCache().reset() to clear the cache between prompts so each new prompt
starts with a fresh KV state.
- Around line 47-64: The current UI displays "Qwen3.5 Interactive CLI" and
instructs to use 'exit'/'quit' but only reads a single prompt; change to a real
interactive loop: wrap the prompt/read/process/response block in a while loop
that repeatedly prints "Prompt text (or 'exit/quit'): ", reads prompt_text,
breaks when prompt_text is "exit" or "quit", and otherwise calls
tokenizer.convertMessage(...) and iterates model.chat(...) to stream responses
(use the existing tokenizer.detokenize and step.cur_token_id usage). Ensure
exceptions are caught per-iteration and the prompt/response separator
(std::string(60,'-')) is printed after each exchange so multi-turn behavior
matches the banner.
In `@mllm/models/qwen3_5/modeling_qwen3_5.hpp`:
- Around line 537-546: The constructor Qwen3_5ForCausalLM currently initializes
kv_cache_ with cfg.num_hidden_layers causing slots for GDN layers that never use
the cache; change the allocation to use only the number of full-attention layers
(compute count of layers where attention is enabled) when constructing
nn::StaticCache and adjust any code that sets layer_idx_ in those full-attention
layer objects so layer_idx_ becomes an index into this reduced cache rather than
the original global layer index (update places that reference layer_idx_ to use
the new mapping). Ensure the cache creation uses the same cfg.max_cache_length,
cfg.num_attention_heads, cfg.num_key_value_heads, head_dim and types as before.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3d4151fc-e18a-478c-96cc-f69c90a24e1f
📒 Files selected for processing (12)
examples/CMakeLists.txtexamples/qwen3_5/CMakeLists.txtexamples/qwen3_5/main.cppmllm/backends/qnn/aot/passes/LLM2QnnLoweringPass.cppmllm/backends/qnn/aot/visitor/RoPE.cppmllm/backends/qnn/aot/visitor/RoPE.hppmllm/backends/qnn/aot/visitor/SiLU.cppmllm/backends/qnn/aot/visitor/SiLU.hppmllm/models/qwen3_5/configuration_qwen3_5.hppmllm/models/qwen3_5/modeling_qwen3_5.hppmllm/models/qwen3_5/modeling_qwen3_5_qnn_aot.hppmllm/models/qwen3_5/tokenization_qwen3_5.hpp
| if (help.isSet()) { | ||
| Argparse::printHelp(); | ||
| mllm::shutdownContext(); | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
Help flag check may fail due to required argument validation.
The help flag is checked after Argparse::parse(argc, argv) (line 17), but arguments are marked as required(true). If the user runs with just -h, the parser may fail before reaching the help check.
🛠️ Proposed fix
Consider checking for help before validating required arguments, or ensuring Argparse::parse doesn't error on missing required args when -h is present. A common pattern is:
+ // Check for help before full parse
+ for (int i = 1; i < argc; ++i) {
+ if (std::string(argv[i]) == "-h" || std::string(argv[i]) == "--help") {
+ Argparse::printHelp();
+ mllm::shutdownContext();
+ return 0;
+ }
+ }
+
Argparse::parse(argc, argv);
-
- if (help.isSet()) {
- Argparse::printHelp();
- mllm::shutdownContext();
- return 0;
- }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/qwen3_5/main.cpp` around lines 30 - 34, The help-check currently
runs after Argparse::parse(argc, argv) which will fail if required(true)
arguments are missing; modify the flow so help is detected before required-arg
validation by either (a) checking argv for "-h" or "--help" before calling
Argparse::parse, or (b) using a parser option/alternate parse method that allows
short-circuiting validation when help is present; update the code around
Argparse::parse(argc, argv), help.isSet(), and the subsequent
Argparse::printHelp()/mllm::shutdownContext() so that printHelp() is invoked and
the program exits without triggering required-argument errors.
| auto inputs_it = op->inputs().begin(); | ||
| auto i_0 = (*inputs_it)->cast_<ir::tensor::TensorValue>(); // input tensor | ||
| auto i_sin = (*std::next(inputs_it))->cast_<ir::tensor::TensorValue>(); // sin embeddings | ||
| auto i_cos = (*std::next(inputs_it, 2))->cast_<ir::tensor::TensorValue>(); // cos embeddings | ||
|
|
||
| // RoPE output | ||
| auto o_0 = op->outputs().front()->cast_<ir::tensor::TensorValue>(); | ||
|
|
There was a problem hiding this comment.
Validate RoPE input/output arity before dereferencing iterators.
Line 47, Line 48, and Line 51 assume 3 inputs and 1 output exist. Missing checks here can dereference invalid iterators on malformed IR.
Proposed fix
// RoPE inputs: x, sin, cos
+ if (op->inputs().size() < 3 || op->outputs().empty()) {
+ MLLM_ERROR("RoPE lowering expects at least 3 inputs and 1 output");
+ return false;
+ }
auto inputs_it = op->inputs().begin();
auto i_0 = (*inputs_it)->cast_<ir::tensor::TensorValue>(); // input tensor
auto i_sin = (*std::next(inputs_it))->cast_<ir::tensor::TensorValue>(); // sin embeddings
auto i_cos = (*std::next(inputs_it, 2))->cast_<ir::tensor::TensorValue>(); // cos embeddingsAs per coding guidelines, "Validate inputs for public APIs and critical internal functions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mllm/backends/qnn/aot/visitor/RoPE.cpp` around lines 45 - 52, The code in
RoPE.cpp dereferences op->inputs() and op->outputs() without checking arity
(uses inputs_it, i_0, i_sin, i_cos and o_0); add explicit validation that
op->inputs().size() >= 3 and op->outputs().size() >= 1 before creating inputs_it
or dereferencing std::next, and handle the error path (return, throw, or log and
abort) consistent with surrounding error handling conventions so you never
dereference invalid iterators for the RoPE operator.
| // pose_type parameter: 0 for standard RoPE | ||
| // The custom HTP op uses this to select between different RoPE variants | ||
| qnn_op_node->emplaceParamScalar(mllm::qnn::QNNParamScalarWrapper::create("pose_type", static_cast<uint32_t>(0))); | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "1) Verify RoPE kernel pose_type handling in custom package:"
rg -n -C3 'pose_type|only support pose_type' mllm/backends/qnn/custom-op-package/LLaMAPackage/src/ops/RoPE.cpp
echo
echo "2) Verify current lowering hardcodes pose_type:"
rg -n -C3 'emplaceParamScalar|pose_type' mllm/backends/qnn/aot/visitor/RoPE.cpp
echo
echo "3) Find where RoPE variant is represented in IR/AOp to thread through lowering:"
rg -n -C3 'class .*RoPEOp|pose_type' mllm/core mllm/compile mllm/modelsRepository: UbiquitousLearning/mllm
Length of output: 6375
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Check RoPEOp struct/class definition ==="
cat -n mllm/core/aops/RoPEOp.hpp
echo
echo "=== Check visitor code to see how RoPEOp variant is extracted ==="
cat -n mllm/backends/qnn/aot/visitor/RoPE.cpp | head -80Repository: UbiquitousLearning/mllm
Length of output: 5148
Add pose_type field to RoPEOpOptions and pass through the correct variant to the custom op kernel.
The hardcoded pose_type = 0 at line 59 will cause a fatal error in the custom HTP op. The kernel (mllm/backends/qnn/custom-op-package/LLaMAPackage/src/ops/RoPE.cpp lines 664–665) only supports pose_type == 2 (LLaMA) or pose_type == 4, and rejects all other values with ErrorFatal. The IR representation (RoPEOpOptions) currently lacks a field to capture the variant, so the lowering cannot pass the correct value. Either add a pose_type field to RoPEOpOptions, or derive it from existing parameters (e.g., rope_theta).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mllm/backends/qnn/aot/visitor/RoPE.cpp` around lines 57 - 60, The lowering
currently hardcodes pose_type=0 in RoPE.cpp when calling
qnn_op_node->emplaceParamScalar, but the custom kernel expects pose_type 2 or 4;
add a pose_type field to the IR struct/class RoPEOpOptions (or derive it
deterministically from an existing field like rope_theta) and populate it from
the op creation site, then replace the hardcoded static_cast<uint32_t>(0) in
RoPE.cpp with the value from RoPEOpOptions (passed through into
QNNParamScalarWrapper::create("pose_type", static_cast<uint32_t>(pose_type))).
Ensure the new field is serialized/constructed where RoPEOpOptions is
instantiated so the lowering can read it.
| // Copy quantization recipe from output to intermediate if available | ||
| if (op->getAttr("quant_recipe")) { | ||
| sigmoid_out->setAttr("quant_recipe", op->getAttr("quant_recipe")); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check how other QNN AOT patterns handle quant_recipe
rg -n "quant_recipe" mllm/backends/qnn/aot/visitor/ -A2 -B2Repository: UbiquitousLearning/mllm
Length of output: 3254
🏁 Script executed:
# Get the full SiLU.cpp file to understand the complete context
cat -n mllm/backends/qnn/aot/visitor/SiLU.cppRepository: UbiquitousLearning/mllm
Length of output: 3503
🏁 Script executed:
# Check if there are other activation/unary patterns with optional quant_recipe handling
rg "if.*getAttr.*quant_recipe" mllm/backends/qnn/aot/visitor/ -B5 -A5Repository: UbiquitousLearning/mllm
Length of output: 1043
🏁 Script executed:
# Count pattern frequency - how many patterns use MLLM_RETURN_FALSE_IF_NOT vs optional check
echo "=== Patterns requiring quant_recipe (MLLM_RETURN_FALSE_IF_NOT) ===" && \
rg "MLLM_RETURN_FALSE_IF_NOT.*quant_recipe" mllm/backends/qnn/aot/visitor/ | wc -l && \
echo "=== Patterns with optional quant_recipe (if check) ===" && \
rg "if.*getAttr.*quant_recipe" mllm/backends/qnn/aot/visitor/ | wc -lRepository: UbiquitousLearning/mllm
Length of output: 192
Ensure quantization metadata is handled consistently with other QNN patterns.
SiLU treats quant_recipe as optional (lines 45-47), but all other QNN AOT patterns—RMSNorm, Add, Mul, and Neg—require it via MLLM_RETURN_FALSE_IF_NOT. This inconsistency means intermediate sigmoid_out tensors may lack quantization metadata if the attribute is missing, potentially breaking quantization expectations in QNN deployment.
Either make quant_recipe required here for consistency with other patterns, or document why optional handling is intentional for SiLU decomposition.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mllm/backends/qnn/aot/visitor/SiLU.cpp` around lines 44 - 47, The SiLU
visitor currently copies quant_recipe only if present which is inconsistent with
other QNN AOT patterns; update the SiLU handling to require the quantization
recipe the same way RMSNorm/Add/Mul/Neg do by checking for the attribute with
MLLM_RETURN_FALSE_IF_NOT (or equivalent) on op->getAttr("quant_recipe") and then
unconditionally set it on sigmoid_out via sigmoid_out->setAttr("quant_recipe",
op->getAttr("quant_recipe")); this enforces presence of quant metadata and
matches existing QNN patterns.
| full_attention_interval = tc.value("full_attention_interval", 4); | ||
|
|
||
| // GDN (Gated Delta Network) parameters | ||
| linear_num_key_heads = tc.value("linear_num_key_heads", 16); | ||
| linear_num_value_heads = tc.value("linear_num_value_heads", 16); | ||
| linear_key_head_dim = tc.value("linear_key_head_dim", 128); | ||
| linear_value_head_dim = tc.value("linear_value_head_dim", 128); | ||
| linear_conv_kernel_dim = tc.value("linear_conv_kernel_dim", 4); | ||
|
|
||
| // RoPE parameters (nested under rope_parameters) | ||
| if (tc.contains("rope_parameters")) { | ||
| auto& rp = tc["rope_parameters"]; | ||
| rope_theta = rp.value("rope_theta", 10000000.0f); | ||
| partial_rotary_factor = rp.value("partial_rotary_factor", 0.25f); | ||
| } | ||
|
|
||
| // Layer types: explicit list or computed from full_attention_interval | ||
| if (tc.contains("layer_types")) { | ||
| for (auto& lt : tc["layer_types"]) { layer_types.push_back(lt.get<std::string>()); } | ||
| } else { | ||
| for (int i = 0; i < num_hidden_layers; ++i) { | ||
| if ((i + 1) % full_attention_interval == 0) { | ||
| layer_types.push_back("full_attention"); | ||
| } else { | ||
| layer_types.push_back("linear_attention"); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Validate full_attention_interval and layer_types before deriving layer layout.
At Line 55, (i + 1) % full_attention_interval is unsafe when full_attention_interval <= 0. Also, when layer_types is provided (Line 51-53), there is no validation for allowed values or expected length vs num_hidden_layers, which can break downstream helpers.
Proposed fix
// Qwen3.5 hybrid attention
attn_output_gate = tc.value("attn_output_gate", true);
full_attention_interval = tc.value("full_attention_interval", 4);
+ if (full_attention_interval <= 0) {
+ throw std::invalid_argument("full_attention_interval must be > 0");
+ }
@@
if (tc.contains("layer_types")) {
for (auto& lt : tc["layer_types"]) { layer_types.push_back(lt.get<std::string>()); }
+ if (static_cast<int32_t>(layer_types.size()) != num_hidden_layers) {
+ throw std::invalid_argument("layer_types length must equal num_hidden_layers");
+ }
+ for (const auto& lt : layer_types) {
+ if (lt != "full_attention" && lt != "linear_attention") {
+ throw std::invalid_argument("layer_types must be 'full_attention' or 'linear_attention'");
+ }
+ }
} else {As per coding guidelines, "Validate inputs for public APIs and critical internal functions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mllm/models/qwen3_5/configuration_qwen3_5.hpp` around lines 34 - 61, The code
computes layer_types using (i + 1) % full_attention_interval which is unsafe
when full_attention_interval <= 0 and it accepts provided layer_types without
validating length or allowed values; update initialization to first validate
full_attention_interval is > 0 (otherwise set to 1 or return an error) before
using it in the loop, and when tc.contains("layer_types") verify that
layer_types.size() == num_hidden_layers and each entry is one of the allowed
strings (e.g., "full_attention" or "linear_attention")—if validation fails,
either normalize/replace invalid entries or emit an error so downstream helpers
do not break. Ensure these checks reference the variables
full_attention_interval, layer_types, and num_hidden_layers and are performed
before deriving the computed layer layout.
| auto attn_weights = nn::functional::matmul(q, k, false, true); // [B, H, S, S] | ||
| auto scale = 1.f / sqrtf(static_cast<float>(head_k_dim_)); | ||
| attn_weights = attn_weights * scale; | ||
| auto output = nn::functional::matmul(attn_weights, v); // [B, H, S, Dv] |
There was a problem hiding this comment.
Missing softmax in GDN attention computation.
The comment at line 353 describes softmax(Q K^T / sqrt(d)) V, but the code only computes (Q K^T / sqrt(d)) V without applying softmax. This produces unbounded attention weights instead of a probability distribution.
🐛 Proposed fix
auto attn_weights = nn::functional::matmul(q, k, false, true); // [B, H, S, S]
auto scale = 1.f / sqrtf(static_cast<float>(head_k_dim_));
attn_weights = attn_weights * scale;
+ // TODO: Add causal mask here for autoregressive behavior
+ attn_weights = nn::functional::softmax(attn_weights, -1);
auto output = nn::functional::matmul(attn_weights, v); // [B, H, S, Dv]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| auto attn_weights = nn::functional::matmul(q, k, false, true); // [B, H, S, S] | |
| auto scale = 1.f / sqrtf(static_cast<float>(head_k_dim_)); | |
| attn_weights = attn_weights * scale; | |
| auto output = nn::functional::matmul(attn_weights, v); // [B, H, S, Dv] | |
| auto attn_weights = nn::functional::matmul(q, k, false, true); // [B, H, S, S] | |
| auto scale = 1.f / sqrtf(static_cast<float>(head_k_dim_)); | |
| attn_weights = attn_weights * scale; | |
| // TODO: Add causal mask here for autoregressive behavior | |
| attn_weights = nn::functional::softmax(attn_weights, -1); | |
| auto output = nn::functional::matmul(attn_weights, v); // [B, H, S, Dv] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mllm/models/qwen3_5/modeling_qwen3_5.hpp` around lines 360 - 363, The
attention code omits the softmax step: after computing attn_weights = matmul(q,
k, false, true) and scaling by 1/sqrt(head_k_dim_), apply a softmax over the
source sequence dimension (the S dimension of attn_weights) before multiplying
by v; update the sequence that builds attn_weights (using
nn::functional::softmax or equivalent) so attn_weights becomes a probability
distribution prior to the subsequent matmul with v, keeping existing variables
q, k, v, head_k_dim_, and the final output = matmul(attn_weights, v).
| if (cfg.tie_word_embeddings) { | ||
| lm_head_ = reg<nn::Linear>("lm_head_out", cfg.hidden_size, cfg.vocab_size, false, cfg.linear_impl_type); | ||
| } |
There was a problem hiding this comment.
Inverted logic for tie_word_embeddings.
The standard meaning of tie_word_embeddings=true is that the embedding weights are shared with the output projection, so no separate lm_head is needed. The code appears to have inverted logic:
- Line 553-555: Creates
lm_head_only whentie_word_embeddingsis true - Line 591: Uses
lm_head_only whentie_word_embeddings_is true
This means when embeddings are tied, a separate projection is created and used, which defeats the purpose of tying.
🐛 Proposed fix
- if (cfg.tie_word_embeddings) {
+ if (!cfg.tie_word_embeddings) {
lm_head_ = reg<nn::Linear>("lm_head_out", cfg.hidden_size, cfg.vocab_size, false, cfg.linear_impl_type);
}And for the forward pass:
- if (tie_word_embeddings_) { sequence = lm_head_(sequence); }
+ if (tie_word_embeddings_) {
+ // Reuse embedding weights transposed as output projection
+ sequence = nn::functional::linear(sequence, llm.language_model_.embedding_.weight(), std::nullopt);
+ } else {
+ sequence = lm_head_(sequence);
+ }Or alternatively, always create lm_head_ but load embedding weights into it when tying.
Also applies to: 591-591
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mllm/models/qwen3_5/modeling_qwen3_5.hpp` around lines 553 - 555, The logic
for tie_word_embeddings is inverted: instead of creating and using lm_head_ when
cfg.tie_word_embeddings is true, create lm_head_ only when
cfg.tie_word_embeddings is false (or alternatively always create but copy/alias
embedding weights into lm_head_ when cfg.tie_word_embeddings is true). Update
the construction in the constructor around lm_head_ (currently guarded by
cfg.tie_word_embeddings) to be guarded by !cfg.tie_word_embeddings (or add
weight sharing code to copy embedding->lm_head_ when tying), and update the
forward usage (the branch keyed by tie_word_embeddings_) to use lm_head_ only
when embeddings are not tied (or use the shared embedding weights when tying).
Ensure references to lm_head_, cfg.tie_word_embeddings and tie_word_embeddings_
are consistent after the change.
| // 6. Match \s+(?!\S) | ||
| if (std::iswspace(str[pos])) { | ||
| size_t start = pos; | ||
| while (pos < str.size() && std::iswspace(str[pos])) ++pos; | ||
| if (pos >= str.size() || std::iswspace(str[pos])) { | ||
| matched = str.substr(start, pos - start); | ||
| return true; | ||
| } else { | ||
| pos = start; | ||
| } | ||
| } |
There was a problem hiding this comment.
Logic issue in pattern 6: lookahead (?!\S) is incorrectly implemented.
The comment says "Match \s+(?!\S)" (whitespace not followed by non-whitespace), but the condition pos >= str.size() || std::iswspace(str[pos]) is checked after consuming the whitespace. This means if the next char is non-whitespace, the match is abandoned and pos is reset, which is correct. However, the check std::iswspace(str[pos]) at line 107 will always be false because the while loop at line 106 already consumed all whitespace.
🔧 Proposed fix
The condition should check that we're at end-of-string OR the next character after whitespace is also whitespace (which can't happen since we consumed all whitespace). The intent seems to be matching trailing whitespace. Consider:
// 6. Match \s+(?!\S) — trailing whitespace not followed by non-whitespace
if (std::iswspace(str[pos])) {
size_t start = pos;
while (pos < str.size() && std::iswspace(str[pos])) ++pos;
- if (pos >= str.size() || std::iswspace(str[pos])) {
+ if (pos >= str.size()) {
+ // Only match if this whitespace is at end of string (true trailing whitespace)
matched = str.substr(start, pos - start);
return true;
- } else {
+ }
+ // Otherwise, fall through to pattern 7 for general whitespace
pos = start;
- }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // 6. Match \s+(?!\S) | |
| if (std::iswspace(str[pos])) { | |
| size_t start = pos; | |
| while (pos < str.size() && std::iswspace(str[pos])) ++pos; | |
| if (pos >= str.size() || std::iswspace(str[pos])) { | |
| matched = str.substr(start, pos - start); | |
| return true; | |
| } else { | |
| pos = start; | |
| } | |
| } | |
| // 6. Match \s+(?!\S) | |
| if (std::iswspace(str[pos])) { | |
| size_t start = pos; | |
| while (pos < str.size() && std::iswspace(str[pos])) ++pos; | |
| if (pos >= str.size()) { | |
| // Only match if this whitespace is at end of string (true trailing whitespace) | |
| matched = str.substr(start, pos - start); | |
| return true; | |
| } | |
| // Otherwise, fall through to pattern 7 for general whitespace | |
| pos = start; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mllm/models/qwen3_5/tokenization_qwen3_5.hpp` around lines 103 - 113, The
lookahead logic for pattern 6 is wrong: after consuming all whitespace in the
while loop inside the tokenization code (the block that sets start and advances
pos), the subsequent check should only accept the match when we reached
end-of-string (trailing whitespace), because next char after the loop will never
be whitespace; update the condition in that block to test only for pos >=
str.size() (remove the std::iswspace(str[pos]) check) so matched =
str.substr(start, pos-start) is returned only for trailing whitespace.
| std::wstring detokenize(int64_t pos_idx) override { | ||
| auto str = _detokenize(pos_idx); | ||
| std::string utf_8_str; | ||
| for (wchar_t c : str) { utf_8_str.push_back((unsigned char)(bytes_2_unicode_dict_inverse_[c])); } | ||
| return {mllm::preprocessor::utf8string2WideString(utf_8_str)}; | ||
| } |
There was a problem hiding this comment.
Missing error handling for unknown characters in detokenize.
If a wchar_t from the BPE output is not found in bytes_2_unicode_dict_inverse_, std::unordered_map::operator[] returns a default-initialized value (0), which will insert a null byte into the UTF-8 string. This could cause silent truncation or corruption.
🛡️ Proposed fix with validation
std::wstring detokenize(int64_t pos_idx) override {
auto str = _detokenize(pos_idx);
std::string utf_8_str;
- for (wchar_t c : str) { utf_8_str.push_back((unsigned char)(bytes_2_unicode_dict_inverse_[c])); }
+ for (wchar_t c : str) {
+ auto it = bytes_2_unicode_dict_inverse_.find(c);
+ if (it != bytes_2_unicode_dict_inverse_.end()) {
+ utf_8_str.push_back(static_cast<unsigned char>(it->second));
+ }
+ // Silently skip unmapped characters, or optionally log a warning
+ }
return {mllm::preprocessor::utf8string2WideString(utf_8_str)};
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mllm/models/qwen3_5/tokenization_qwen3_5.hpp` around lines 206 - 211, In
detokenize, avoid using bytes_2_unicode_dict_inverse_[c] (operator[]) which
inserts defaults for unknown wchar_t; instead, validate lookups from
bytes_2_unicode_dict_inverse_ for each wchar_t returned by _detokenize and
handle misses explicitly (e.g., use find() or at() with try/catch) — on missing
mapping append a safe UTF-8 replacement sequence (such as the UTF-8 bytes for
U+FFFD or a question mark) to utf_8_str or return an error; update the function
around _detokenize, utf_8_str, and bytes_2_unicode_dict_inverse_ to perform this
check before converting with mllm::preprocessor::utf8string2WideString.
| ARGenerationOutputPast convertMessage(const Qwen3_5Message& message) { | ||
| auto applied_string = Qwen3_5Message::message_template; | ||
| size_t pos = applied_string.find("{{{prompt}}}"); | ||
| applied_string.replace(pos, 12, message.prompt); |
There was a problem hiding this comment.
Potential undefined behavior if template placeholder is missing.
If message_template does not contain "{{{prompt}}}", find() returns std::string::npos, and calling replace(npos, 12, ...) causes undefined behavior (typically an out-of-range exception or buffer overflow).
🛡️ Proposed fix with validation
ARGenerationOutputPast convertMessage(const Qwen3_5Message& message) {
auto applied_string = Qwen3_5Message::message_template;
size_t pos = applied_string.find("{{{prompt}}}");
+ if (pos == std::string::npos) {
+ throw std::runtime_error("Message template missing {{{prompt}}} placeholder");
+ }
applied_string.replace(pos, 12, message.prompt);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ARGenerationOutputPast convertMessage(const Qwen3_5Message& message) { | |
| auto applied_string = Qwen3_5Message::message_template; | |
| size_t pos = applied_string.find("{{{prompt}}}"); | |
| applied_string.replace(pos, 12, message.prompt); | |
| ARGenerationOutputPast convertMessage(const Qwen3_5Message& message) { | |
| auto applied_string = Qwen3_5Message::message_template; | |
| size_t pos = applied_string.find("{{{prompt}}}"); | |
| if (pos == std::string::npos) { | |
| throw std::runtime_error("Message template missing {{{prompt}}} placeholder"); | |
| } | |
| applied_string.replace(pos, 12, message.prompt); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mllm/models/qwen3_5/tokenization_qwen3_5.hpp` around lines 226 - 229, In
convertMessage, guard against std::string::npos returned by
Qwen3_5Message::message_template.find("{{{prompt}}}") before calling replace:
check the found position (pos) and if it is npos handle the missing placeholder
(for example, log/throw/return an error or append the prompt into the template)
instead of calling applied_string.replace(pos, 12, message.prompt); this
prevents undefined behavior when the "{{{prompt}}}" token is absent.
…ect inference Replace simplified linear attention placeholder with complete Gated Delta Network: - Manual causal depthwise Conv1D for prefill and decode (framework Conv1D had stride issues with transposed input) - Sequential scan with delta rule: decay/beta gating from A_log/dt_bias, L2 normalization on q/k, query scaling by 1/sqrt(d_k) - Conv state and recurrent state management across decode steps - Standard RMSNorm (no add_unit_offset) matching Qwen3_5RMSNormGated - EOS set to im_end_token_id for proper chat termination Verified output matches HuggingFace reference for "Hello world" and math prompts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d model Implements the full QNN ahead-of-time compilation and quantization pipeline for Qwen3.5's hybrid architecture (18 GDN + 6 full attention layers). Only the 6 full attention layers are compiled to QNN; GDN layers stay on CPU. Python quantization pipeline: - modeling_qwen3_5.py: PyTorch model with QDQ nodes for full attention layers - runner.py: Qwen3_5Quantizer with calibrate/convert/save workflow - train.py: CLI entry point for end-to-end quantization C++ AOT compile & runtime: - compile.cpp: traces 6 attention layers x 2 seq lengths into 12 QNN graphs - modeling_qwen3_5_qnn_aot.hpp: per-layer QNN trace model with partial RoPE, output gating, and Conv2D LPBQ w4a16 quantization - aot_run.cpp: hybrid CPU+QNN runtime with per-layer dispatch - convert_weights.py: pre-bakes +1.0 RMSNorm offset and RoPE tables Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Please check Guidelines for Contributing.
Summary by CodeRabbit