Hi, since DINOv2 and v3 use the iBOT loss and and learn the embeddings of masked patches using surrounding patches, can we consider them also as Joint Embedding Predictive Architectures? Maybe we can think of the linear head used in the iBOT loss as their predictor in which case they use the predictor in both the student and the teacher. They use two views as opposed to I-JEPA and V-JEPA using the same image or video segment for both the student and the teacher.
Hi, since DINOv2 and v3 use the iBOT loss and and learn the embeddings of masked patches using surrounding patches, can we consider them also as Joint Embedding Predictive Architectures? Maybe we can think of the linear head used in the iBOT loss as their predictor in which case they use the predictor in both the student and the teacher. They use two views as opposed to I-JEPA and V-JEPA using the same image or video segment for both the student and the teacher.