Skip to content

[quantization] Introduce wrappers for Qwen3VLTextDecoderLayer and Qwen3VLTextModel#535

Draft
dayo09 wants to merge 4 commits intoSamsung:mainfrom
dayo09:0303-text-models
Draft

[quantization] Introduce wrappers for Qwen3VLTextDecoderLayer and Qwen3VLTextModel#535
dayo09 wants to merge 4 commits intoSamsung:mainfrom
dayo09:0303-text-models

Conversation

@dayo09
Copy link
Copy Markdown
Contributor

@dayo09 dayo09 commented Mar 5, 2026

Let's add wrappers for upper level qwen3vl layers.

TICO-DCO-1.0-Signed-off-by: Dayoung Lee dayoung.lee@samsung.com

…n3VLTextModel

- Add `QuantQwen3VLTextDecoderLayer`: wraps attention, MLP, and layernorm
  blocks; pre-builds static causal mask and RoPE templates to avoid
  dynamic ops in forward pass
- Add `QuantQwen3VLTextModel`: pre-computes shared causal mask and RoPE
  once and passes them to every decoder layer, so they are quantized
  exactly once rather than independently in each layer
- Register both wrappers in `_CORE_MODULES`

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment on lines +190 to +191
self._fq(cos, self.obs_cos),
self._fq(sin, self.obs_sin),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dayo09
Sorry for disturbance. But

self._fq(cos[:, : hidden_states.size(1), :], self.obs_cos),

will disable dependence on size of inputs. (It proved to be useful for LLama).
It's similar to self.causal_mask_template[..., :seq_len, :seq_len].to(device) above (Ln127).
IMHO.

@dayo09 dayo09 force-pushed the 0303-text-models branch from 4829fa4 to e71a9b1 Compare March 11, 2026 07:15
print(f"│ Mean |diff|: {(q_out - fp_out).abs().mean().item():.6f}")
print(f"│ PEIR : {compute_peir(fp_out, q_out) * 100:.6f} %")
print("└──────────────────────────────────────────────────────")
print(plot_two_outputs(fp_out, q_out))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

┌───────────── Quantization Error Summary ─────────────
│ Mean |diff|: 0.071578PEIR       : 9.253764 %
└──────────────────────────────────────────────────────
    ┌────────────────────────────────────────────┐
 5.1┤                                         •  │
 3.4┤                              • ••••    •   │
 1.7┤                        ••••••••••          │
 0.0┤                 ••••••••••                 │
-1.7┤            • ••••••                        │
-3.4┤   ••••••••                                 │
-5.1┤  •                                         │
    └┬──────────┬──────────┬─────────┬──────────┬┘
   -5.1       -2.5        0.0       2.5       5.1 

print(f"│ Mean |diff|: {(q_out - fp_out).abs().mean().item():.6f}")
print(f"│ PEIR : {compute_peir(fp_out, q_out) * 100:.6f} %")
print("└──────────────────────────────────────────────────────")
print(plot_two_outputs(fp_out, q_out))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python3 tico/quantization/wrapq/examples/qwen/quantize_text_model.py 
┌───────────── Quantization Error Summary ─────────────
│ Mean |diff|: 0.904804PEIR       : 351.709125 %
└──────────────────────────────────────────────────────
      ┌──────────────────────────────────────────┐
  28.2┤                                          │
      │                                          │
      │                              •• •• •     │
   4.7┤                                •••••     │
      │                               •••••      │
      │                              ••••        │
 -18.7┤                             •••          │
      │                                          │
      │                                          │
 -42.2┤                                          │
      │                                          │
      │                                          │
      │                                          │
 -65.6┤                                          │
      │                                          │
      │                                          │
 -89.1┤                                          │
      │                                          │
      │                                       •  │
-112.5┤                                          │
      └┬─────────┬──────────┬─────────┬─────────┬┘
    -112.5     -77.4      -42.2     -7.0     28.2 

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one big outlier. 😢

Comment on lines +253 to +256
for m in (self.embed_tokens, self.norm):
yield from m._all_observers()
for m in self.layers:
yield from m._all_observers()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I'm wondering about the purpose of yielding the observers of submodules' PTQWrappers since they return nothing anyway... (see also #494).

Comment on lines +214 to +217
yield from self.self_attn._all_observers()
yield from self.mlp._all_observers()
yield from self.input_layernorm._all_observers()
yield from self.post_attention_layernorm._all_observers()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I'm wondering about the purpose of yielding the observers of submodules' PTQWrappers since they return nothing anyway... (see also #494).

Comment on lines +133 to +140
def get_position_embeddings_for(
self, hidden_states: torch.Tensor
) -> Tuple[torch.Tensor, torch.Tensor]:
# Delegate to the model's actual Qwen3VLTextRotaryEmbedding so that
# MRoPE frequencies are split correctly by mrope_section.
S = hidden_states.size(1)
position_ids = torch.arange(S, device=hidden_states.device).unsqueeze(0)
return self.rotary_emb(hidden_states, position_ids)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for me

When tico.convert is called, it uses torch.export.export to capture the model as a static computation graph. At this point, the dynamic self.rotary_emb(hidden_states, position_ids) call will be executed with concrete inputs, resulting in static tensor values for cos and sin.

@mhs4670go
Copy link
Copy Markdown
Contributor

@dvsav Thanks for the review! Acutally, @dayo09 will be working in another department for a year. Would you be able to continue working on this PR?

@dvsav
Copy link
Copy Markdown
Contributor

dvsav commented Mar 18, 2026

@dvsav Thanks for the review! Acutally, @dayo09 will be working in another department for a year. Would you be able to continue working on this PR?

Hi @mhs4670go
Sure, I'll take over this PR. Thanks for letting me know.

@dvsav
Copy link
Copy Markdown
Contributor

dvsav commented Mar 24, 2026

@dvsav Thanks for the review! Acutally, @dayo09 will be working in another department for a year. Would you be able to continue working on this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants