Question about the code implementation

Hello,
Thank you for your excellent research.

I have a question regarding the implementation of Dora-VAE.
In general, using multiple layers often helps extract better features, but I noticed that only one cross-attention layer is used in the encoding process.
Is there a particular reason for this design choice?

Thank you in advance.