Why is RoPE not applied on T2V/T2A cross attention?

Hello, thanks for you great work!

When I read the code, I have a quick question: according to [the implementation here](https://github.com/character-ai/Ovi/blob/main/ovi/modules/fusion.py#L103), the cross attention between audio/video and text does not use `rope_apply` on `q` and `k`. Why is RoPE not applied here?