VL-JEPA

This VL-JEPA implimentation takes direct insperation from the original VL-JEPA paper

Components Breakdown

X-Encoder (Video Encoder) I used a frozen DINOv3 VIT-S model
QueryEmbedding (Used the embedding layers of google/gemma-3-270m-it)
The predictor is also google/gemma-1-270m-it where I took the last 4 layers of the model The first and last layers without the embedding layer
Y-Encoder is the same EmbeddingGemma model from the paper. Which I have also frozen

The goal is to train a good predictor for the two pretrained models (DINOv3 & EmbeddingGemma)

My Asumptions

I assume the pretrained models already have a unique representation of the world
Each model has a different internal representation

[x]- I used projection layers to help map from one models representaion to the others' representation [x]- This also comes in handy to map different model output dimentions [D] to each other

Dataset setup

mkdir -p ./msrvtt_videos

# Download the full zip (~8-10 GB compressed)
wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip -P ./msrvtt_videos/


unzip ./msrvtt_videos/MSRVTT.zip -d ./msrvtt_videos/

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
dataset.py		dataset.py
model.py		model.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VL-JEPA

Components Breakdown

My Asumptions

Dataset setup

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VL-JEPA

Components Breakdown

My Asumptions

Dataset setup

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages