[quantization] Introduce wrappers for Qwen3VLTextDecoderLayer and Qwen3VLTextModel#535
[quantization] Introduce wrappers for Qwen3VLTextDecoderLayer and Qwen3VLTextModel#535dayo09 wants to merge 4 commits intoSamsung:mainfrom
Conversation
…n3VLTextModel - Add `QuantQwen3VLTextDecoderLayer`: wraps attention, MLP, and layernorm blocks; pre-builds static causal mask and RoPE templates to avoid dynamic ops in forward pass - Add `QuantQwen3VLTextModel`: pre-computes shared causal mask and RoPE once and passes them to every decoder layer, so they are quantized exactly once rather than independently in each layer - Register both wrappers in `_CORE_MODULES` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| self._fq(cos, self.obs_cos), | ||
| self._fq(sin, self.obs_sin), |
There was a problem hiding this comment.
@dayo09
Sorry for disturbance. But
will disable dependence on size of
inputs. (It proved to be useful for LLama).It's similar to
self.causal_mask_template[..., :seq_len, :seq_len].to(device) above (Ln127).IMHO.
4829fa4 to
e71a9b1
Compare
| print(f"│ Mean |diff|: {(q_out - fp_out).abs().mean().item():.6f}") | ||
| print(f"│ PEIR : {compute_peir(fp_out, q_out) * 100:.6f} %") | ||
| print("└──────────────────────────────────────────────────────") | ||
| print(plot_two_outputs(fp_out, q_out)) |
There was a problem hiding this comment.
┌───────────── Quantization Error Summary ─────────────
│ Mean |diff|: 0.071578
│ PEIR : 9.253764 %
└──────────────────────────────────────────────────────
┌────────────────────────────────────────────┐
5.1┤ • │
3.4┤ • •••• • │
1.7┤ •••••••••• │
0.0┤ •••••••••• │
-1.7┤ • •••••• │
-3.4┤ •••••••• │
-5.1┤ • │
└┬──────────┬──────────┬─────────┬──────────┬┘
-5.1 -2.5 0.0 2.5 5.1 | print(f"│ Mean |diff|: {(q_out - fp_out).abs().mean().item():.6f}") | ||
| print(f"│ PEIR : {compute_peir(fp_out, q_out) * 100:.6f} %") | ||
| print("└──────────────────────────────────────────────────────") | ||
| print(plot_two_outputs(fp_out, q_out)) |
There was a problem hiding this comment.
python3 tico/quantization/wrapq/examples/qwen/quantize_text_model.py
┌───────────── Quantization Error Summary ─────────────
│ Mean |diff|: 0.904804
│ PEIR : 351.709125 %
└──────────────────────────────────────────────────────
┌──────────────────────────────────────────┐
28.2┤ │
│ │
│ •• •• • │
4.7┤ ••••• │
│ ••••• │
│ •••• │
-18.7┤ ••• │
│ │
│ │
-42.2┤ │
│ │
│ │
│ │
-65.6┤ │
│ │
│ │
-89.1┤ │
│ │
│ • │
-112.5┤ │
└┬─────────┬──────────┬─────────┬─────────┬┘
-112.5 -77.4 -42.2 -7.0 28.2 There was a problem hiding this comment.
There is one big outlier. 😢
| for m in (self.embed_tokens, self.norm): | ||
| yield from m._all_observers() | ||
| for m in self.layers: | ||
| yield from m._all_observers() |
There was a problem hiding this comment.
🤔 I'm wondering about the purpose of yielding the observers of submodules' PTQWrappers since they return nothing anyway... (see also #494).
| yield from self.self_attn._all_observers() | ||
| yield from self.mlp._all_observers() | ||
| yield from self.input_layernorm._all_observers() | ||
| yield from self.post_attention_layernorm._all_observers() |
There was a problem hiding this comment.
🤔 I'm wondering about the purpose of yielding the observers of submodules' PTQWrappers since they return nothing anyway... (see also #494).
| def get_position_embeddings_for( | ||
| self, hidden_states: torch.Tensor | ||
| ) -> Tuple[torch.Tensor, torch.Tensor]: | ||
| # Delegate to the model's actual Qwen3VLTextRotaryEmbedding so that | ||
| # MRoPE frequencies are split correctly by mrope_section. | ||
| S = hidden_states.size(1) | ||
| position_ids = torch.arange(S, device=hidden_states.device).unsqueeze(0) | ||
| return self.rotary_emb(hidden_states, position_ids) |
There was a problem hiding this comment.
Note for me
When tico.convert is called, it uses torch.export.export to capture the model as a static computation graph. At this point, the dynamic self.rotary_emb(hidden_states, position_ids) call will be executed with concrete inputs, resulting in static tensor values for cos and sin.
Hi @mhs4670go |
|
Let's add wrappers for upper level qwen3vl layers.
TICO-DCO-1.0-Signed-off-by: Dayoung Lee dayoung.lee@samsung.com