This VL-JEPA implimentation takes direct insperation from the original VL-JEPA paper
- X-Encoder (Video Encoder) I used a frozen DINOv3 VIT-S model
- QueryEmbedding (Used the embedding layers of
google/gemma-3-270m-it) - The predictor is also
google/gemma-1-270m-itwhere I took the last 4 layers of the model The first and last layers without the embedding layer - Y-Encoder is the same
EmbeddingGemmamodel from the paper. Which I have also frozen
The goal is to train a good predictor for the two pretrained models (DINOv3 & EmbeddingGemma)
- I assume the pretrained models already have a unique representation of the world
- Each model has a different internal representation
[x]- I used projection layers to help map from one models representaion to the others' representation [x]- This also comes in handy to map different model output dimentions [D] to each other
mkdir -p ./msrvtt_videos
# Download the full zip (~8-10 GB compressed)
wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip -P ./msrvtt_videos/
unzip ./msrvtt_videos/MSRVTT.zip -d ./msrvtt_videos/