Skip to content

Added Qwen3.5 0.8B#657

Draft
FarmersWrap wants to merge 3 commits intoUbiquitousLearning:mainfrom
FarmersWrap:feat/qwen3.5-qnn
Draft

Added Qwen3.5 0.8B#657
FarmersWrap wants to merge 3 commits intoUbiquitousLearning:mainfrom
FarmersWrap:feat/qwen3.5-qnn

Conversation

@FarmersWrap
Copy link

@FarmersWrap FarmersWrap commented Mar 22, 2026

Please check Guidelines for Contributing.

Summary by CodeRabbit

  • New Features
    • Added support for Qwen3.5 model with hybrid attention architecture combining linear-attention GDN layers and full-attention blocks.
    • Introduced interactive CLI example for loading and chatting with Qwen3.5.
    • Added tokenizer implementation for Qwen3.5 with SentencePiece BPE integration.
    • Extended QNN AOT optimization with RoPE and SiLU operation handlers.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 22, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d7893796-2325-4e49-bb38-22d788da2387

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR introduces comprehensive support for the Qwen3.5 language model, including model configuration, architecture with hybrid attention mechanisms (full attention and GDN layers), tokenization, and a CLI example. It also extends the QNN AOT backend with new lowering patterns for SiLU and RoPE operations.

Changes

Cohort / File(s) Summary
QNN AOT Lowering Visitors
mllm/backends/qnn/aot/passes/LLM2QnnLoweringPass.cpp, mllm/backends/qnn/aot/visitor/SiLU.*, mllm/backends/qnn/aot/visitor/RoPE.*
Added QNN AOT pattern handlers for SiLU and RoPE operations. SiLU decomposes into Sigmoid and ElementWiseMultiply nodes; RoPE creates a custom HTP node with pose_type parameter. Patterns are registered in the lowering pass registry.
Qwen3.5 Model Architecture
mllm/models/qwen3_5/configuration_qwen3_5.hpp, mllm/models/qwen3_5/modeling_qwen3_5.hpp, mllm/models/qwen3_5/modeling_qwen3_5_qnn_aot.hpp
Introduced Qwen3.5 configuration parsing (hybrid attention settings, partial RoPE, GDN parameters) and model implementation featuring 18 GDN layers with linear-attention-style computation and 6 full-attention layers with output gating, plus helper routines for partial RoPE rotation.
Qwen3.5 Tokenization
mllm/models/qwen3_5/tokenization_qwen3_5.hpp
Added Qwen3.5-compatible tokenizer with regex-based splitting, BPE encoding/decoding, special token handling, and message template conversion for chat inputs.
Qwen3.5 Example Application
examples/qwen3_5/main.cpp, examples/qwen3_5/CMakeLists.txt, examples/CMakeLists.txt
Added interactive CLI example (mllm-qwen3-5-runner) for loading Qwen3.5 models, supporting streaming chat generation, Perfetto profiling, and memory reporting.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • liang1232018
  • chenghuaWang
  • oreomaker

Poem

🐰 A Qwen hops forth with hybrid grace,
Full attention and GDN intertwine,
RoPE spins while SiLU graces the place,
QNN patterns align so fine,
Chat flows stream, memories trace! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description only contains the contribution template and a link to guidelines without any implementation details, objectives, or actual description content. Provide a detailed description explaining the Qwen3.5 addition, including key features (hybrid attention, GDN layers, partial RoPE), QNN AOT support, and any breaking changes or migration notes.
Docstring Coverage ⚠️ Warning Docstring coverage is 10.20% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Added Qwen3.5 0.8B' directly matches the main change—adding comprehensive support for the Qwen3.5 model including configuration, tokenizer, modeling, and example code.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

@FarmersWrap FarmersWrap marked this pull request as draft March 22, 2026 00:50
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Nitpick comments (3)
examples/qwen3_5/main.cpp (2)

55-64: Consider resetting KV cache before inference if multi-turn support is added.

Based on the ARGeneration implementation (see mllm/models/ARGeneration.cpp:445-447), chat() does not reset the KV cache. For single-prompt usage this works, but if you add a loop, you'll need to call model.kvCache().reset() between independent prompts to avoid stale cached states affecting new responses.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qwen3_5/main.cpp` around lines 55 - 64, The example currently calls
model.chat(inputs) directly and ARGeneration::chat does not reset the KV cache;
to avoid stale state when adding multi-turn or repeated independent prompts,
call model.kvCache().reset() before each independent inference (e.g., right
before invoking model.chat(inputs) in the try block). Reference the ARGeneration
behavior and use model.kvCache().reset() to clear the cache between prompts so
each new prompt starts with a fresh KV state.

47-64: UI text is misleading: "Interactive CLI" but only single prompt.

The banner says "Qwen3.5 Interactive CLI" and mentions "Enter 'exit' or 'quit' to end the session", but the code only processes one prompt and exits. This may confuse users expecting a multi-turn session.

Either add a loop for actual interactive behavior or update the messaging.

♻️ Proposed fix for multi-turn interaction
     fmt::print("\n{:*^60}\n", " Qwen3.5 Interactive CLI ");
     fmt::print("Enter 'exit' or 'quit' to end the session\n\n");

     std::string prompt_text;
+    while (true) {
+      fmt::print("Prompt text (or 'exit/quit'): ");
+      std::getline(std::cin, prompt_text);
+
+      if (prompt_text == "exit" || prompt_text == "quit" || std::cin.eof()) {
+        break;
+      }

-    fmt::print("Prompt text (or 'exit/quit'): ");
-    std::getline(std::cin, prompt_text);
-
-    try {
-      fmt::print("Processing...\n");
-      auto inputs = tokenizer.convertMessage({.prompt = prompt_text});
+      try {
+        fmt::print("Processing...\n");
+        model.kvCache().reset();  // Reset KV cache between turns
+        auto inputs = tokenizer.convertMessage({.prompt = prompt_text});

-      fmt::print("\nResponse: ");
+        fmt::print("\nResponse: ");

-      for (auto& step : model.chat(inputs)) { std::wcout << tokenizer.detokenize(step.cur_token_id) << std::flush; }
+        for (auto& step : model.chat(inputs)) { std::wcout << tokenizer.detokenize(step.cur_token_id) << std::flush; }

-      fmt::print("\n{}\n", std::string(60, '-'));
-    } catch (const std::exception& e) { fmt::print("\nError: {}\n{}\n", e.what(), std::string(60, '-')); }
+        fmt::print("\n{}\n", std::string(60, '-'));
+      } catch (const std::exception& e) { fmt::print("\nError: {}\n{}\n", e.what(), std::string(60, '-')); }
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qwen3_5/main.cpp` around lines 47 - 64, The current UI displays
"Qwen3.5 Interactive CLI" and instructs to use 'exit'/'quit' but only reads a
single prompt; change to a real interactive loop: wrap the
prompt/read/process/response block in a while loop that repeatedly prints
"Prompt text (or 'exit/quit'): ", reads prompt_text, breaks when prompt_text is
"exit" or "quit", and otherwise calls tokenizer.convertMessage(...) and iterates
model.chat(...) to stream responses (use the existing tokenizer.detokenize and
step.cur_token_id usage). Ensure exceptions are caught per-iteration and the
prompt/response separator (std::string(60,'-')) is printed after each exchange
so multi-turn behavior matches the banner.
mllm/models/qwen3_5/modeling_qwen3_5.hpp (1)

537-546: KV cache allocates slots for all layers, but only full attention layers use it.

The StaticCache is initialized with cfg.num_hidden_layers (all 24 layers), but only the 6 full attention layers use KV caching. GDN layers (line 438) explicitly don't use the cache. This wastes memory allocating 18 unused cache slots.

♻️ Suggested optimization

Consider only allocating cache for full attention layers:

-    kv_cache_ = nn::StaticCache(cfg.max_cache_length, cfg.num_hidden_layers,
+    kv_cache_ = nn::StaticCache(cfg.max_cache_length, cfg.numFullAttentionLayers(),
                                 cfg.num_attention_heads,

This would also require updating layer_idx_ in full attention layers to be an index into the reduced cache rather than the global layer index.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/qwen3_5/modeling_qwen3_5.hpp` around lines 537 - 546, The
constructor Qwen3_5ForCausalLM currently initializes kv_cache_ with
cfg.num_hidden_layers causing slots for GDN layers that never use the cache;
change the allocation to use only the number of full-attention layers (compute
count of layers where attention is enabled) when constructing nn::StaticCache
and adjust any code that sets layer_idx_ in those full-attention layer objects
so layer_idx_ becomes an index into this reduced cache rather than the original
global layer index (update places that reference layer_idx_ to use the new
mapping). Ensure the cache creation uses the same cfg.max_cache_length,
cfg.num_attention_heads, cfg.num_key_value_heads, head_dim and types as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/qwen3_5/main.cpp`:
- Around line 30-34: The help-check currently runs after Argparse::parse(argc,
argv) which will fail if required(true) arguments are missing; modify the flow
so help is detected before required-arg validation by either (a) checking argv
for "-h" or "--help" before calling Argparse::parse, or (b) using a parser
option/alternate parse method that allows short-circuiting validation when help
is present; update the code around Argparse::parse(argc, argv), help.isSet(),
and the subsequent Argparse::printHelp()/mllm::shutdownContext() so that
printHelp() is invoked and the program exits without triggering
required-argument errors.

In `@mllm/backends/qnn/aot/visitor/RoPE.cpp`:
- Around line 45-52: The code in RoPE.cpp dereferences op->inputs() and
op->outputs() without checking arity (uses inputs_it, i_0, i_sin, i_cos and
o_0); add explicit validation that op->inputs().size() >= 3 and
op->outputs().size() >= 1 before creating inputs_it or dereferencing std::next,
and handle the error path (return, throw, or log and abort) consistent with
surrounding error handling conventions so you never dereference invalid
iterators for the RoPE operator.
- Around line 57-60: The lowering currently hardcodes pose_type=0 in RoPE.cpp
when calling qnn_op_node->emplaceParamScalar, but the custom kernel expects
pose_type 2 or 4; add a pose_type field to the IR struct/class RoPEOpOptions (or
derive it deterministically from an existing field like rope_theta) and populate
it from the op creation site, then replace the hardcoded
static_cast<uint32_t>(0) in RoPE.cpp with the value from RoPEOpOptions (passed
through into QNNParamScalarWrapper::create("pose_type",
static_cast<uint32_t>(pose_type))). Ensure the new field is
serialized/constructed where RoPEOpOptions is instantiated so the lowering can
read it.

In `@mllm/backends/qnn/aot/visitor/SiLU.cpp`:
- Around line 44-47: The SiLU visitor currently copies quant_recipe only if
present which is inconsistent with other QNN AOT patterns; update the SiLU
handling to require the quantization recipe the same way RMSNorm/Add/Mul/Neg do
by checking for the attribute with MLLM_RETURN_FALSE_IF_NOT (or equivalent) on
op->getAttr("quant_recipe") and then unconditionally set it on sigmoid_out via
sigmoid_out->setAttr("quant_recipe", op->getAttr("quant_recipe")); this enforces
presence of quant metadata and matches existing QNN patterns.

In `@mllm/models/qwen3_5/configuration_qwen3_5.hpp`:
- Around line 115-117: The method isFullAttentionLayer(int layer_idx) performs
unchecked indexing into layer_types; guard against out-of-range and negative
indexes by validating layer_idx before accessing layer_types: check that
layer_idx >= 0 and layer_idx < static_cast<int>(layer_types.size()) and return a
safe default (e.g., false) or handle the error (assert/log/throw) if the index
is invalid; update isFullAttentionLayer to perform this bounds check against
layer_types.size() and then compare layer_types[layer_idx] == "full_attention".
- Around line 34-61: The code computes layer_types using (i + 1) %
full_attention_interval which is unsafe when full_attention_interval <= 0 and it
accepts provided layer_types without validating length or allowed values; update
initialization to first validate full_attention_interval is > 0 (otherwise set
to 1 or return an error) before using it in the loop, and when
tc.contains("layer_types") verify that layer_types.size() == num_hidden_layers
and each entry is one of the allowed strings (e.g., "full_attention" or
"linear_attention")—if validation fails, either normalize/replace invalid
entries or emit an error so downstream helpers do not break. Ensure these checks
reference the variables full_attention_interval, layer_types, and
num_hidden_layers and are performed before deriving the computed layer layout.

In `@mllm/models/qwen3_5/modeling_qwen3_5.hpp`:
- Around line 553-555: The logic for tie_word_embeddings is inverted: instead of
creating and using lm_head_ when cfg.tie_word_embeddings is true, create
lm_head_ only when cfg.tie_word_embeddings is false (or alternatively always
create but copy/alias embedding weights into lm_head_ when
cfg.tie_word_embeddings is true). Update the construction in the constructor
around lm_head_ (currently guarded by cfg.tie_word_embeddings) to be guarded by
!cfg.tie_word_embeddings (or add weight sharing code to copy embedding->lm_head_
when tying), and update the forward usage (the branch keyed by
tie_word_embeddings_) to use lm_head_ only when embeddings are not tied (or use
the shared embedding weights when tying). Ensure references to lm_head_,
cfg.tie_word_embeddings and tie_word_embeddings_ are consistent after the
change.
- Around line 360-363: The attention code omits the softmax step: after
computing attn_weights = matmul(q, k, false, true) and scaling by
1/sqrt(head_k_dim_), apply a softmax over the source sequence dimension (the S
dimension of attn_weights) before multiplying by v; update the sequence that
builds attn_weights (using nn::functional::softmax or equivalent) so
attn_weights becomes a probability distribution prior to the subsequent matmul
with v, keeping existing variables q, k, v, head_k_dim_, and the final output =
matmul(attn_weights, v).
- Around line 339-363: The simplified GDN attention path is missing the causal
mask and softmax, breaking autoregression; in the GDN fallback (where q, k, v
are transposed and attn_weights is computed) apply the same causal masking used
by Qwen3_5FullAttention (e.g., reuse the class CausalMask member or call
mask_(attn_weights)) after scaling and then run softmax before multiplying by v,
ensuring attn_weights becomes a masked softmax(QK^T / sqrt(d)) and preserving
the autoregressive property.

In `@mllm/models/qwen3_5/tokenization_qwen3_5.hpp`:
- Around line 226-229: In convertMessage, guard against std::string::npos
returned by Qwen3_5Message::message_template.find("{{{prompt}}}") before calling
replace: check the found position (pos) and if it is npos handle the missing
placeholder (for example, log/throw/return an error or append the prompt into
the template) instead of calling applied_string.replace(pos, 12,
message.prompt); this prevents undefined behavior when the "{{{prompt}}}" token
is absent.
- Around line 206-211: In detokenize, avoid using
bytes_2_unicode_dict_inverse_[c] (operator[]) which inserts defaults for unknown
wchar_t; instead, validate lookups from bytes_2_unicode_dict_inverse_ for each
wchar_t returned by _detokenize and handle misses explicitly (e.g., use find()
or at() with try/catch) — on missing mapping append a safe UTF-8 replacement
sequence (such as the UTF-8 bytes for U+FFFD or a question mark) to utf_8_str or
return an error; update the function around _detokenize, utf_8_str, and
bytes_2_unicode_dict_inverse_ to perform this check before converting with
mllm::preprocessor::utf8string2WideString.
- Around line 103-113: The lookahead logic for pattern 6 is wrong: after
consuming all whitespace in the while loop inside the tokenization code (the
block that sets start and advances pos), the subsequent check should only accept
the match when we reached end-of-string (trailing whitespace), because next char
after the loop will never be whitespace; update the condition in that block to
test only for pos >= str.size() (remove the std::iswspace(str[pos]) check) so
matched = str.substr(start, pos-start) is returned only for trailing whitespace.

---

Nitpick comments:
In `@examples/qwen3_5/main.cpp`:
- Around line 55-64: The example currently calls model.chat(inputs) directly and
ARGeneration::chat does not reset the KV cache; to avoid stale state when adding
multi-turn or repeated independent prompts, call model.kvCache().reset() before
each independent inference (e.g., right before invoking model.chat(inputs) in
the try block). Reference the ARGeneration behavior and use
model.kvCache().reset() to clear the cache between prompts so each new prompt
starts with a fresh KV state.
- Around line 47-64: The current UI displays "Qwen3.5 Interactive CLI" and
instructs to use 'exit'/'quit' but only reads a single prompt; change to a real
interactive loop: wrap the prompt/read/process/response block in a while loop
that repeatedly prints "Prompt text (or 'exit/quit'): ", reads prompt_text,
breaks when prompt_text is "exit" or "quit", and otherwise calls
tokenizer.convertMessage(...) and iterates model.chat(...) to stream responses
(use the existing tokenizer.detokenize and step.cur_token_id usage). Ensure
exceptions are caught per-iteration and the prompt/response separator
(std::string(60,'-')) is printed after each exchange so multi-turn behavior
matches the banner.

In `@mllm/models/qwen3_5/modeling_qwen3_5.hpp`:
- Around line 537-546: The constructor Qwen3_5ForCausalLM currently initializes
kv_cache_ with cfg.num_hidden_layers causing slots for GDN layers that never use
the cache; change the allocation to use only the number of full-attention layers
(compute count of layers where attention is enabled) when constructing
nn::StaticCache and adjust any code that sets layer_idx_ in those full-attention
layer objects so layer_idx_ becomes an index into this reduced cache rather than
the original global layer index (update places that reference layer_idx_ to use
the new mapping). Ensure the cache creation uses the same cfg.max_cache_length,
cfg.num_attention_heads, cfg.num_key_value_heads, head_dim and types as before.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3d4151fc-e18a-478c-96cc-f69c90a24e1f

📥 Commits

Reviewing files that changed from the base of the PR and between 05c75f8 and 1e672a9.

📒 Files selected for processing (12)
  • examples/CMakeLists.txt
  • examples/qwen3_5/CMakeLists.txt
  • examples/qwen3_5/main.cpp
  • mllm/backends/qnn/aot/passes/LLM2QnnLoweringPass.cpp
  • mllm/backends/qnn/aot/visitor/RoPE.cpp
  • mllm/backends/qnn/aot/visitor/RoPE.hpp
  • mllm/backends/qnn/aot/visitor/SiLU.cpp
  • mllm/backends/qnn/aot/visitor/SiLU.hpp
  • mllm/models/qwen3_5/configuration_qwen3_5.hpp
  • mllm/models/qwen3_5/modeling_qwen3_5.hpp
  • mllm/models/qwen3_5/modeling_qwen3_5_qnn_aot.hpp
  • mllm/models/qwen3_5/tokenization_qwen3_5.hpp

Comment on lines +30 to +34
if (help.isSet()) {
Argparse::printHelp();
mllm::shutdownContext();
return 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Help flag check may fail due to required argument validation.

The help flag is checked after Argparse::parse(argc, argv) (line 17), but arguments are marked as required(true). If the user runs with just -h, the parser may fail before reaching the help check.

🛠️ Proposed fix

Consider checking for help before validating required arguments, or ensuring Argparse::parse doesn't error on missing required args when -h is present. A common pattern is:

+  // Check for help before full parse
+  for (int i = 1; i < argc; ++i) {
+    if (std::string(argv[i]) == "-h" || std::string(argv[i]) == "--help") {
+      Argparse::printHelp();
+      mllm::shutdownContext();
+      return 0;
+    }
+  }
+
   Argparse::parse(argc, argv);
-
-  if (help.isSet()) {
-    Argparse::printHelp();
-    mllm::shutdownContext();
-    return 0;
-  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qwen3_5/main.cpp` around lines 30 - 34, The help-check currently
runs after Argparse::parse(argc, argv) which will fail if required(true)
arguments are missing; modify the flow so help is detected before required-arg
validation by either (a) checking argv for "-h" or "--help" before calling
Argparse::parse, or (b) using a parser option/alternate parse method that allows
short-circuiting validation when help is present; update the code around
Argparse::parse(argc, argv), help.isSet(), and the subsequent
Argparse::printHelp()/mllm::shutdownContext() so that printHelp() is invoked and
the program exits without triggering required-argument errors.

Comment on lines +45 to +52
auto inputs_it = op->inputs().begin();
auto i_0 = (*inputs_it)->cast_<ir::tensor::TensorValue>(); // input tensor
auto i_sin = (*std::next(inputs_it))->cast_<ir::tensor::TensorValue>(); // sin embeddings
auto i_cos = (*std::next(inputs_it, 2))->cast_<ir::tensor::TensorValue>(); // cos embeddings

// RoPE output
auto o_0 = op->outputs().front()->cast_<ir::tensor::TensorValue>();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate RoPE input/output arity before dereferencing iterators.

Line 47, Line 48, and Line 51 assume 3 inputs and 1 output exist. Missing checks here can dereference invalid iterators on malformed IR.

Proposed fix
   // RoPE inputs: x, sin, cos
+  if (op->inputs().size() < 3 || op->outputs().empty()) {
+    MLLM_ERROR("RoPE lowering expects at least 3 inputs and 1 output");
+    return false;
+  }
   auto inputs_it = op->inputs().begin();
   auto i_0 = (*inputs_it)->cast_<ir::tensor::TensorValue>();          // input tensor
   auto i_sin = (*std::next(inputs_it))->cast_<ir::tensor::TensorValue>();  // sin embeddings
   auto i_cos = (*std::next(inputs_it, 2))->cast_<ir::tensor::TensorValue>();  // cos embeddings

As per coding guidelines, "Validate inputs for public APIs and critical internal functions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/backends/qnn/aot/visitor/RoPE.cpp` around lines 45 - 52, The code in
RoPE.cpp dereferences op->inputs() and op->outputs() without checking arity
(uses inputs_it, i_0, i_sin, i_cos and o_0); add explicit validation that
op->inputs().size() >= 3 and op->outputs().size() >= 1 before creating inputs_it
or dereferencing std::next, and handle the error path (return, throw, or log and
abort) consistent with surrounding error handling conventions so you never
dereference invalid iterators for the RoPE operator.

Comment on lines +57 to +60
// pose_type parameter: 0 for standard RoPE
// The custom HTP op uses this to select between different RoPE variants
qnn_op_node->emplaceParamScalar(mllm::qnn::QNNParamScalarWrapper::create("pose_type", static_cast<uint32_t>(0)));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "1) Verify RoPE kernel pose_type handling in custom package:"
rg -n -C3 'pose_type|only support pose_type' mllm/backends/qnn/custom-op-package/LLaMAPackage/src/ops/RoPE.cpp

echo
echo "2) Verify current lowering hardcodes pose_type:"
rg -n -C3 'emplaceParamScalar|pose_type' mllm/backends/qnn/aot/visitor/RoPE.cpp

echo
echo "3) Find where RoPE variant is represented in IR/AOp to thread through lowering:"
rg -n -C3 'class .*RoPEOp|pose_type' mllm/core mllm/compile mllm/models

Repository: UbiquitousLearning/mllm

Length of output: 6375


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check RoPEOp struct/class definition ===" 
cat -n mllm/core/aops/RoPEOp.hpp

echo
echo "=== Check visitor code to see how RoPEOp variant is extracted ==="
cat -n mllm/backends/qnn/aot/visitor/RoPE.cpp | head -80

Repository: UbiquitousLearning/mllm

Length of output: 5148


Add pose_type field to RoPEOpOptions and pass through the correct variant to the custom op kernel.

The hardcoded pose_type = 0 at line 59 will cause a fatal error in the custom HTP op. The kernel (mllm/backends/qnn/custom-op-package/LLaMAPackage/src/ops/RoPE.cpp lines 664–665) only supports pose_type == 2 (LLaMA) or pose_type == 4, and rejects all other values with ErrorFatal. The IR representation (RoPEOpOptions) currently lacks a field to capture the variant, so the lowering cannot pass the correct value. Either add a pose_type field to RoPEOpOptions, or derive it from existing parameters (e.g., rope_theta).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/backends/qnn/aot/visitor/RoPE.cpp` around lines 57 - 60, The lowering
currently hardcodes pose_type=0 in RoPE.cpp when calling
qnn_op_node->emplaceParamScalar, but the custom kernel expects pose_type 2 or 4;
add a pose_type field to the IR struct/class RoPEOpOptions (or derive it
deterministically from an existing field like rope_theta) and populate it from
the op creation site, then replace the hardcoded static_cast<uint32_t>(0) in
RoPE.cpp with the value from RoPEOpOptions (passed through into
QNNParamScalarWrapper::create("pose_type", static_cast<uint32_t>(pose_type))).
Ensure the new field is serialized/constructed where RoPEOpOptions is
instantiated so the lowering can read it.

Comment on lines +44 to +47
// Copy quantization recipe from output to intermediate if available
if (op->getAttr("quant_recipe")) {
sigmoid_out->setAttr("quant_recipe", op->getAttr("quant_recipe"));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check how other QNN AOT patterns handle quant_recipe
rg -n "quant_recipe" mllm/backends/qnn/aot/visitor/ -A2 -B2

Repository: UbiquitousLearning/mllm

Length of output: 3254


🏁 Script executed:

# Get the full SiLU.cpp file to understand the complete context
cat -n mllm/backends/qnn/aot/visitor/SiLU.cpp

Repository: UbiquitousLearning/mllm

Length of output: 3503


🏁 Script executed:

# Check if there are other activation/unary patterns with optional quant_recipe handling
rg "if.*getAttr.*quant_recipe" mllm/backends/qnn/aot/visitor/ -B5 -A5

Repository: UbiquitousLearning/mllm

Length of output: 1043


🏁 Script executed:

# Count pattern frequency - how many patterns use MLLM_RETURN_FALSE_IF_NOT vs optional check
echo "=== Patterns requiring quant_recipe (MLLM_RETURN_FALSE_IF_NOT) ===" && \
rg "MLLM_RETURN_FALSE_IF_NOT.*quant_recipe" mllm/backends/qnn/aot/visitor/ | wc -l && \
echo "=== Patterns with optional quant_recipe (if check) ===" && \
rg "if.*getAttr.*quant_recipe" mllm/backends/qnn/aot/visitor/ | wc -l

Repository: UbiquitousLearning/mllm

Length of output: 192


Ensure quantization metadata is handled consistently with other QNN patterns.

SiLU treats quant_recipe as optional (lines 45-47), but all other QNN AOT patterns—RMSNorm, Add, Mul, and Neg—require it via MLLM_RETURN_FALSE_IF_NOT. This inconsistency means intermediate sigmoid_out tensors may lack quantization metadata if the attribute is missing, potentially breaking quantization expectations in QNN deployment.

Either make quant_recipe required here for consistency with other patterns, or document why optional handling is intentional for SiLU decomposition.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/backends/qnn/aot/visitor/SiLU.cpp` around lines 44 - 47, The SiLU
visitor currently copies quant_recipe only if present which is inconsistent with
other QNN AOT patterns; update the SiLU handling to require the quantization
recipe the same way RMSNorm/Add/Mul/Neg do by checking for the attribute with
MLLM_RETURN_FALSE_IF_NOT (or equivalent) on op->getAttr("quant_recipe") and then
unconditionally set it on sigmoid_out via sigmoid_out->setAttr("quant_recipe",
op->getAttr("quant_recipe")); this enforces presence of quant metadata and
matches existing QNN patterns.

Comment on lines +34 to +61
full_attention_interval = tc.value("full_attention_interval", 4);

// GDN (Gated Delta Network) parameters
linear_num_key_heads = tc.value("linear_num_key_heads", 16);
linear_num_value_heads = tc.value("linear_num_value_heads", 16);
linear_key_head_dim = tc.value("linear_key_head_dim", 128);
linear_value_head_dim = tc.value("linear_value_head_dim", 128);
linear_conv_kernel_dim = tc.value("linear_conv_kernel_dim", 4);

// RoPE parameters (nested under rope_parameters)
if (tc.contains("rope_parameters")) {
auto& rp = tc["rope_parameters"];
rope_theta = rp.value("rope_theta", 10000000.0f);
partial_rotary_factor = rp.value("partial_rotary_factor", 0.25f);
}

// Layer types: explicit list or computed from full_attention_interval
if (tc.contains("layer_types")) {
for (auto& lt : tc["layer_types"]) { layer_types.push_back(lt.get<std::string>()); }
} else {
for (int i = 0; i < num_hidden_layers; ++i) {
if ((i + 1) % full_attention_interval == 0) {
layer_types.push_back("full_attention");
} else {
layer_types.push_back("linear_attention");
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Validate full_attention_interval and layer_types before deriving layer layout.

At Line 55, (i + 1) % full_attention_interval is unsafe when full_attention_interval <= 0. Also, when layer_types is provided (Line 51-53), there is no validation for allowed values or expected length vs num_hidden_layers, which can break downstream helpers.

Proposed fix
     // Qwen3.5 hybrid attention
     attn_output_gate = tc.value("attn_output_gate", true);
     full_attention_interval = tc.value("full_attention_interval", 4);
+    if (full_attention_interval <= 0) {
+      throw std::invalid_argument("full_attention_interval must be > 0");
+    }
@@
     if (tc.contains("layer_types")) {
       for (auto& lt : tc["layer_types"]) { layer_types.push_back(lt.get<std::string>()); }
+      if (static_cast<int32_t>(layer_types.size()) != num_hidden_layers) {
+        throw std::invalid_argument("layer_types length must equal num_hidden_layers");
+      }
+      for (const auto& lt : layer_types) {
+        if (lt != "full_attention" && lt != "linear_attention") {
+          throw std::invalid_argument("layer_types must be 'full_attention' or 'linear_attention'");
+        }
+      }
     } else {

As per coding guidelines, "Validate inputs for public APIs and critical internal functions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/qwen3_5/configuration_qwen3_5.hpp` around lines 34 - 61, The code
computes layer_types using (i + 1) % full_attention_interval which is unsafe
when full_attention_interval <= 0 and it accepts provided layer_types without
validating length or allowed values; update initialization to first validate
full_attention_interval is > 0 (otherwise set to 1 or return an error) before
using it in the loop, and when tc.contains("layer_types") verify that
layer_types.size() == num_hidden_layers and each entry is one of the allowed
strings (e.g., "full_attention" or "linear_attention")—if validation fails,
either normalize/replace invalid entries or emit an error so downstream helpers
do not break. Ensure these checks reference the variables
full_attention_interval, layer_types, and num_hidden_layers and are performed
before deriving the computed layer layout.

Comment on lines +360 to +363
auto attn_weights = nn::functional::matmul(q, k, false, true); // [B, H, S, S]
auto scale = 1.f / sqrtf(static_cast<float>(head_k_dim_));
attn_weights = attn_weights * scale;
auto output = nn::functional::matmul(attn_weights, v); // [B, H, S, Dv]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing softmax in GDN attention computation.

The comment at line 353 describes softmax(Q K^T / sqrt(d)) V, but the code only computes (Q K^T / sqrt(d)) V without applying softmax. This produces unbounded attention weights instead of a probability distribution.

🐛 Proposed fix
     auto attn_weights = nn::functional::matmul(q, k, false, true);  // [B, H, S, S]
     auto scale = 1.f / sqrtf(static_cast<float>(head_k_dim_));
     attn_weights = attn_weights * scale;
+    // TODO: Add causal mask here for autoregressive behavior
+    attn_weights = nn::functional::softmax(attn_weights, -1);
     auto output = nn::functional::matmul(attn_weights, v);  // [B, H, S, Dv]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto attn_weights = nn::functional::matmul(q, k, false, true); // [B, H, S, S]
auto scale = 1.f / sqrtf(static_cast<float>(head_k_dim_));
attn_weights = attn_weights * scale;
auto output = nn::functional::matmul(attn_weights, v); // [B, H, S, Dv]
auto attn_weights = nn::functional::matmul(q, k, false, true); // [B, H, S, S]
auto scale = 1.f / sqrtf(static_cast<float>(head_k_dim_));
attn_weights = attn_weights * scale;
// TODO: Add causal mask here for autoregressive behavior
attn_weights = nn::functional::softmax(attn_weights, -1);
auto output = nn::functional::matmul(attn_weights, v); // [B, H, S, Dv]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/qwen3_5/modeling_qwen3_5.hpp` around lines 360 - 363, The
attention code omits the softmax step: after computing attn_weights = matmul(q,
k, false, true) and scaling by 1/sqrt(head_k_dim_), apply a softmax over the
source sequence dimension (the S dimension of attn_weights) before multiplying
by v; update the sequence that builds attn_weights (using
nn::functional::softmax or equivalent) so attn_weights becomes a probability
distribution prior to the subsequent matmul with v, keeping existing variables
q, k, v, head_k_dim_, and the final output = matmul(attn_weights, v).

Comment on lines +553 to +555
if (cfg.tie_word_embeddings) {
lm_head_ = reg<nn::Linear>("lm_head_out", cfg.hidden_size, cfg.vocab_size, false, cfg.linear_impl_type);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Inverted logic for tie_word_embeddings.

The standard meaning of tie_word_embeddings=true is that the embedding weights are shared with the output projection, so no separate lm_head is needed. The code appears to have inverted logic:

  • Line 553-555: Creates lm_head_ only when tie_word_embeddings is true
  • Line 591: Uses lm_head_ only when tie_word_embeddings_ is true

This means when embeddings are tied, a separate projection is created and used, which defeats the purpose of tying.

🐛 Proposed fix
-    if (cfg.tie_word_embeddings) {
+    if (!cfg.tie_word_embeddings) {
       lm_head_ = reg<nn::Linear>("lm_head_out", cfg.hidden_size, cfg.vocab_size, false, cfg.linear_impl_type);
     }

And for the forward pass:

-    if (tie_word_embeddings_) { sequence = lm_head_(sequence); }
+    if (tie_word_embeddings_) {
+      // Reuse embedding weights transposed as output projection
+      sequence = nn::functional::linear(sequence, llm.language_model_.embedding_.weight(), std::nullopt);
+    } else {
+      sequence = lm_head_(sequence);
+    }

Or alternatively, always create lm_head_ but load embedding weights into it when tying.

Also applies to: 591-591

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/qwen3_5/modeling_qwen3_5.hpp` around lines 553 - 555, The logic
for tie_word_embeddings is inverted: instead of creating and using lm_head_ when
cfg.tie_word_embeddings is true, create lm_head_ only when
cfg.tie_word_embeddings is false (or alternatively always create but copy/alias
embedding weights into lm_head_ when cfg.tie_word_embeddings is true). Update
the construction in the constructor around lm_head_ (currently guarded by
cfg.tie_word_embeddings) to be guarded by !cfg.tie_word_embeddings (or add
weight sharing code to copy embedding->lm_head_ when tying), and update the
forward usage (the branch keyed by tie_word_embeddings_) to use lm_head_ only
when embeddings are not tied (or use the shared embedding weights when tying).
Ensure references to lm_head_, cfg.tie_word_embeddings and tie_word_embeddings_
are consistent after the change.

Comment on lines +103 to +113
// 6. Match \s+(?!\S)
if (std::iswspace(str[pos])) {
size_t start = pos;
while (pos < str.size() && std::iswspace(str[pos])) ++pos;
if (pos >= str.size() || std::iswspace(str[pos])) {
matched = str.substr(start, pos - start);
return true;
} else {
pos = start;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Logic issue in pattern 6: lookahead (?!\S) is incorrectly implemented.

The comment says "Match \s+(?!\S)" (whitespace not followed by non-whitespace), but the condition pos >= str.size() || std::iswspace(str[pos]) is checked after consuming the whitespace. This means if the next char is non-whitespace, the match is abandoned and pos is reset, which is correct. However, the check std::iswspace(str[pos]) at line 107 will always be false because the while loop at line 106 already consumed all whitespace.

🔧 Proposed fix

The condition should check that we're at end-of-string OR the next character after whitespace is also whitespace (which can't happen since we consumed all whitespace). The intent seems to be matching trailing whitespace. Consider:

   // 6. Match \s+(?!\S) — trailing whitespace not followed by non-whitespace
   if (std::iswspace(str[pos])) {
     size_t start = pos;
     while (pos < str.size() && std::iswspace(str[pos])) ++pos;
-    if (pos >= str.size() || std::iswspace(str[pos])) {
+    if (pos >= str.size()) {
+      // Only match if this whitespace is at end of string (true trailing whitespace)
       matched = str.substr(start, pos - start);
       return true;
-    } else {
+    }
+    // Otherwise, fall through to pattern 7 for general whitespace
       pos = start;
-    }
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// 6. Match \s+(?!\S)
if (std::iswspace(str[pos])) {
size_t start = pos;
while (pos < str.size() && std::iswspace(str[pos])) ++pos;
if (pos >= str.size() || std::iswspace(str[pos])) {
matched = str.substr(start, pos - start);
return true;
} else {
pos = start;
}
}
// 6. Match \s+(?!\S)
if (std::iswspace(str[pos])) {
size_t start = pos;
while (pos < str.size() && std::iswspace(str[pos])) ++pos;
if (pos >= str.size()) {
// Only match if this whitespace is at end of string (true trailing whitespace)
matched = str.substr(start, pos - start);
return true;
}
// Otherwise, fall through to pattern 7 for general whitespace
pos = start;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/qwen3_5/tokenization_qwen3_5.hpp` around lines 103 - 113, The
lookahead logic for pattern 6 is wrong: after consuming all whitespace in the
while loop inside the tokenization code (the block that sets start and advances
pos), the subsequent check should only accept the match when we reached
end-of-string (trailing whitespace), because next char after the loop will never
be whitespace; update the condition in that block to test only for pos >=
str.size() (remove the std::iswspace(str[pos]) check) so matched =
str.substr(start, pos-start) is returned only for trailing whitespace.

Comment on lines +206 to +211
std::wstring detokenize(int64_t pos_idx) override {
auto str = _detokenize(pos_idx);
std::string utf_8_str;
for (wchar_t c : str) { utf_8_str.push_back((unsigned char)(bytes_2_unicode_dict_inverse_[c])); }
return {mllm::preprocessor::utf8string2WideString(utf_8_str)};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing error handling for unknown characters in detokenize.

If a wchar_t from the BPE output is not found in bytes_2_unicode_dict_inverse_, std::unordered_map::operator[] returns a default-initialized value (0), which will insert a null byte into the UTF-8 string. This could cause silent truncation or corruption.

🛡️ Proposed fix with validation
   std::wstring detokenize(int64_t pos_idx) override {
     auto str = _detokenize(pos_idx);
     std::string utf_8_str;
-    for (wchar_t c : str) { utf_8_str.push_back((unsigned char)(bytes_2_unicode_dict_inverse_[c])); }
+    for (wchar_t c : str) {
+      auto it = bytes_2_unicode_dict_inverse_.find(c);
+      if (it != bytes_2_unicode_dict_inverse_.end()) {
+        utf_8_str.push_back(static_cast<unsigned char>(it->second));
+      }
+      // Silently skip unmapped characters, or optionally log a warning
+    }
     return {mllm::preprocessor::utf8string2WideString(utf_8_str)};
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/qwen3_5/tokenization_qwen3_5.hpp` around lines 206 - 211, In
detokenize, avoid using bytes_2_unicode_dict_inverse_[c] (operator[]) which
inserts defaults for unknown wchar_t; instead, validate lookups from
bytes_2_unicode_dict_inverse_ for each wchar_t returned by _detokenize and
handle misses explicitly (e.g., use find() or at() with try/catch) — on missing
mapping append a safe UTF-8 replacement sequence (such as the UTF-8 bytes for
U+FFFD or a question mark) to utf_8_str or return an error; update the function
around _detokenize, utf_8_str, and bytes_2_unicode_dict_inverse_ to perform this
check before converting with mllm::preprocessor::utf8string2WideString.

Comment on lines +226 to +229
ARGenerationOutputPast convertMessage(const Qwen3_5Message& message) {
auto applied_string = Qwen3_5Message::message_template;
size_t pos = applied_string.find("{{{prompt}}}");
applied_string.replace(pos, 12, message.prompt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential undefined behavior if template placeholder is missing.

If message_template does not contain "{{{prompt}}}", find() returns std::string::npos, and calling replace(npos, 12, ...) causes undefined behavior (typically an out-of-range exception or buffer overflow).

🛡️ Proposed fix with validation
   ARGenerationOutputPast convertMessage(const Qwen3_5Message& message) {
     auto applied_string = Qwen3_5Message::message_template;
     size_t pos = applied_string.find("{{{prompt}}}");
+    if (pos == std::string::npos) {
+      throw std::runtime_error("Message template missing {{{prompt}}} placeholder");
+    }
     applied_string.replace(pos, 12, message.prompt);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ARGenerationOutputPast convertMessage(const Qwen3_5Message& message) {
auto applied_string = Qwen3_5Message::message_template;
size_t pos = applied_string.find("{{{prompt}}}");
applied_string.replace(pos, 12, message.prompt);
ARGenerationOutputPast convertMessage(const Qwen3_5Message& message) {
auto applied_string = Qwen3_5Message::message_template;
size_t pos = applied_string.find("{{{prompt}}}");
if (pos == std::string::npos) {
throw std::runtime_error("Message template missing {{{prompt}}} placeholder");
}
applied_string.replace(pos, 12, message.prompt);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/qwen3_5/tokenization_qwen3_5.hpp` around lines 226 - 229, In
convertMessage, guard against std::string::npos returned by
Qwen3_5Message::message_template.find("{{{prompt}}}") before calling replace:
check the found position (pos) and if it is npos handle the missing placeholder
(for example, log/throw/return an error or append the prompt into the template)
instead of calling applied_string.replace(pos, 12, message.prompt); this
prevents undefined behavior when the "{{{prompt}}}" token is absent.

Ryan Yu and others added 2 commits March 21, 2026 22:24
…ect inference

Replace simplified linear attention placeholder with complete Gated Delta Network:
- Manual causal depthwise Conv1D for prefill and decode (framework Conv1D
  had stride issues with transposed input)
- Sequential scan with delta rule: decay/beta gating from A_log/dt_bias,
  L2 normalization on q/k, query scaling by 1/sqrt(d_k)
- Conv state and recurrent state management across decode steps
- Standard RMSNorm (no add_unit_offset) matching Qwen3_5RMSNormGated
- EOS set to im_end_token_id for proper chat termination

Verified output matches HuggingFace reference for "Hello world" and math prompts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d model

Implements the full QNN ahead-of-time compilation and quantization pipeline
for Qwen3.5's hybrid architecture (18 GDN + 6 full attention layers).
Only the 6 full attention layers are compiled to QNN; GDN layers stay on CPU.

Python quantization pipeline:
- modeling_qwen3_5.py: PyTorch model with QDQ nodes for full attention layers
- runner.py: Qwen3_5Quantizer with calibrate/convert/save workflow
- train.py: CLI entry point for end-to-end quantization

C++ AOT compile & runtime:
- compile.cpp: traces 6 attention layers x 2 seq lengths into 12 QNN graphs
- modeling_qwen3_5_qnn_aot.hpp: per-layer QNN trace model with partial RoPE,
  output gating, and Conv2D LPBQ w4a16 quantization
- aot_run.cpp: hybrid CPU+QNN runtime with per-layer dispatch
- convert_weights.py: pre-bakes +1.0 RMSNorm offset and RoPE tables

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant