Hi V-JEPA team,
Thanks for the great paper and release.
I’m especially interested in the navigation results in:
- Section 3.4: Navigation Planning
- Figure 9
- Table 7: Open Loop Navigation Planning
Could you clarify or release the exact setup used for these results?
The main points that seem unclear are:
-
Frozen or finetuned encoder?
In first paragraph of section 3 (Results), the paper text says V-JEPA 2.1 is used as a frozen encoder, but the Table 7 caption says “we finetune V-JEPA 2.1 on robot navigation datasets”.
-
Exact world model recipe
Section 3.4 says you train a CDiT on top of V-JEPA 2.1, predict clean representations instead of noise, and use DDIM.
Could you share the exact config / architecture changes relative to NWM?
-
Reproducibility
If possible, could you release the config / checkpoint / eval code for the navigation experiment?
This part of the paper is very interesting, and having the exact setup would make reproduction much easier.
Thanks again.
Hi V-JEPA team,
Thanks for the great paper and release.
I’m especially interested in the navigation results in:
Could you clarify or release the exact setup used for these results?
The main points that seem unclear are:
Frozen or finetuned encoder?
In first paragraph of section 3 (Results), the paper text says V-JEPA 2.1 is used as a frozen encoder, but the Table 7 caption says “we finetune V-JEPA 2.1 on robot navigation datasets”.
Exact world model recipe
Section 3.4 says you train a CDiT on top of V-JEPA 2.1, predict clean representations instead of noise, and use DDIM.
Could you share the exact config / architecture changes relative to NWM?
Reproducibility
If possible, could you release the config / checkpoint / eval code for the navigation experiment?
This part of the paper is very interesting, and having the exact setup would make reproduction much easier.
Thanks again.