Question about Identity-RoPE / Ge-RoPE implementation and their interaction

Hi, thanks for releasing the code.

I have a question about the implementation of the two RoPE variants described in the paper: Identity-RoPE and Ge-RoPE.

According to the paper, Identity-RoPE seems to use mask/bounding-rectangle based local normalized coordinates for object regions, while Ge-RoPE uses displacement / flow and confidence to build geometry-aware warped positional encodings.

However, when checking the current public code, I could not locate where these two modules are actually implemented.

In `diffsynth/models/wan_video_dit.py`, the main RoPE frequencies appear to be constructed from the standard `(f, h, w)` token grid:

```python
freqs = torch.cat([
    self.freqs[0][:f]...
    self.freqs[1][:h]...
    self.freqs[2][:w]...
], dim=-1).reshape(f * h * w, 1, -1)
```
Then the block is called as:
```python
x = block(x, context, t_mod, freqs, vggt_tensor=kwargs.get("vggt_tensor", None))
```
So it looks like flow_tensor is not passed into the DiT blocks here.

Inside DiTBlock.forward, I noticed the signature includes:
```python
def forward(self, x, context, t_mod, freqs, vggt_tensor=None, freqs_vggt=None, flow_tensor=None):
```
but I could not find where flow_tensor is used to generate warped RoPE frequencies, or where the mask/bbox-based Identity-RoPE positional remapping is computed.

I also saw that VGGT tokens are appended to the latent tokens before self-attention, which is clear. My question is specifically about the RoPE variants:

1. Where is Identity-RoPE implemented in the released code?
 - Is there a mask → bbox → local normalized coordinate → RoPE frequency remapping step?
 - If yes, could you point to the relevant file/function?
2. Where is Ge-RoPE implemented?
 - Is flow_tensor used to warp the spatial grid / positional frequencies?
 - If yes, where is the displacement resized, normalized by patch size, smoothed, and added to the original grid?
3. How are Identity-RoPE and Ge-RoPE combined?
 - Does Identity-RoPE define the base positional coordinates and Ge-RoPE further warps them?
 - Or are they used in separate attention heads / separate frequency channels?
 - If object-local normalized coordinates are used for Identity-RoPE, how are they made compatible with the standard token-grid coordinates used by Ge-RoPE?
4. Is the current public inference code a simplified version that only uses VGGT token concatenation, while the full Identity-RoPE / Ge-RoPE implementation is in the unreleased training code?

Thanks in advance. I may have missed the relevant implementation, so a pointer to the exact file/function would be very helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Identity-RoPE / Ge-RoPE implementation and their interaction #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question about Identity-RoPE / Ge-RoPE implementation and their interaction #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions