Skip to content

Question about Identity-RoPE / Ge-RoPE implementation and their interaction #3

@PengJingchao

Description

@PengJingchao

Hi, thanks for releasing the code.

I have a question about the implementation of the two RoPE variants described in the paper: Identity-RoPE and Ge-RoPE.

According to the paper, Identity-RoPE seems to use mask/bounding-rectangle based local normalized coordinates for object regions, while Ge-RoPE uses displacement / flow and confidence to build geometry-aware warped positional encodings.

However, when checking the current public code, I could not locate where these two modules are actually implemented.

In diffsynth/models/wan_video_dit.py, the main RoPE frequencies appear to be constructed from the standard (f, h, w) token grid:

freqs = torch.cat([
    self.freqs[0][:f]...
    self.freqs[1][:h]...
    self.freqs[2][:w]...
], dim=-1).reshape(f * h * w, 1, -1)

Then the block is called as:

x = block(x, context, t_mod, freqs, vggt_tensor=kwargs.get("vggt_tensor", None))

So it looks like flow_tensor is not passed into the DiT blocks here.

Inside DiTBlock.forward, I noticed the signature includes:

def forward(self, x, context, t_mod, freqs, vggt_tensor=None, freqs_vggt=None, flow_tensor=None):

but I could not find where flow_tensor is used to generate warped RoPE frequencies, or where the mask/bbox-based Identity-RoPE positional remapping is computed.

I also saw that VGGT tokens are appended to the latent tokens before self-attention, which is clear. My question is specifically about the RoPE variants:

  1. Where is Identity-RoPE implemented in the released code?
  • Is there a mask → bbox → local normalized coordinate → RoPE frequency remapping step?
  • If yes, could you point to the relevant file/function?
  1. Where is Ge-RoPE implemented?
  • Is flow_tensor used to warp the spatial grid / positional frequencies?
  • If yes, where is the displacement resized, normalized by patch size, smoothed, and added to the original grid?
  1. How are Identity-RoPE and Ge-RoPE combined?
  • Does Identity-RoPE define the base positional coordinates and Ge-RoPE further warps them?
  • Or are they used in separate attention heads / separate frequency channels?
  • If object-local normalized coordinates are used for Identity-RoPE, how are they made compatible with the standard token-grid coordinates used by Ge-RoPE?
  1. Is the current public inference code a simplified version that only uses VGGT token concatenation, while the full Identity-RoPE / Ge-RoPE implementation is in the unreleased training code?

Thanks in advance. I may have missed the relevant implementation, so a pointer to the exact file/function would be very helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions