Hi, thanks for releasing the code.
I have a question about the implementation of the two RoPE variants described in the paper: Identity-RoPE and Ge-RoPE.
According to the paper, Identity-RoPE seems to use mask/bounding-rectangle based local normalized coordinates for object regions, while Ge-RoPE uses displacement / flow and confidence to build geometry-aware warped positional encodings.
However, when checking the current public code, I could not locate where these two modules are actually implemented.
In diffsynth/models/wan_video_dit.py, the main RoPE frequencies appear to be constructed from the standard (f, h, w) token grid:
freqs = torch.cat([
self.freqs[0][:f]...
self.freqs[1][:h]...
self.freqs[2][:w]...
], dim=-1).reshape(f * h * w, 1, -1)
Then the block is called as:
x = block(x, context, t_mod, freqs, vggt_tensor=kwargs.get("vggt_tensor", None))
So it looks like flow_tensor is not passed into the DiT blocks here.
Inside DiTBlock.forward, I noticed the signature includes:
def forward(self, x, context, t_mod, freqs, vggt_tensor=None, freqs_vggt=None, flow_tensor=None):
but I could not find where flow_tensor is used to generate warped RoPE frequencies, or where the mask/bbox-based Identity-RoPE positional remapping is computed.
I also saw that VGGT tokens are appended to the latent tokens before self-attention, which is clear. My question is specifically about the RoPE variants:
- Where is Identity-RoPE implemented in the released code?
- Is there a mask → bbox → local normalized coordinate → RoPE frequency remapping step?
- If yes, could you point to the relevant file/function?
- Where is Ge-RoPE implemented?
- Is flow_tensor used to warp the spatial grid / positional frequencies?
- If yes, where is the displacement resized, normalized by patch size, smoothed, and added to the original grid?
- How are Identity-RoPE and Ge-RoPE combined?
- Does Identity-RoPE define the base positional coordinates and Ge-RoPE further warps them?
- Or are they used in separate attention heads / separate frequency channels?
- If object-local normalized coordinates are used for Identity-RoPE, how are they made compatible with the standard token-grid coordinates used by Ge-RoPE?
- Is the current public inference code a simplified version that only uses VGGT token concatenation, while the full Identity-RoPE / Ge-RoPE implementation is in the unreleased training code?
Thanks in advance. I may have missed the relevant implementation, so a pointer to the exact file/function would be very helpful.
Hi, thanks for releasing the code.
I have a question about the implementation of the two RoPE variants described in the paper: Identity-RoPE and Ge-RoPE.
According to the paper, Identity-RoPE seems to use mask/bounding-rectangle based local normalized coordinates for object regions, while Ge-RoPE uses displacement / flow and confidence to build geometry-aware warped positional encodings.
However, when checking the current public code, I could not locate where these two modules are actually implemented.
In
diffsynth/models/wan_video_dit.py, the main RoPE frequencies appear to be constructed from the standard(f, h, w)token grid:Then the block is called as:
So it looks like flow_tensor is not passed into the DiT blocks here.
Inside DiTBlock.forward, I noticed the signature includes:
but I could not find where flow_tensor is used to generate warped RoPE frequencies, or where the mask/bbox-based Identity-RoPE positional remapping is computed.
I also saw that VGGT tokens are appended to the latent tokens before self-attention, which is clear. My question is specifically about the RoPE variants:
Thanks in advance. I may have missed the relevant implementation, so a pointer to the exact file/function would be very helpful.