Hello,
Thank you for your excellent research.
I have a question regarding the implementation of Dora-VAE.
In general, using multiple layers often helps extract better features, but I noticed that only one cross-attention layer is used in the encoding process.
Is there a particular reason for this design choice?
Thank you in advance.