DINOv2 or v3 - conceptual question

Hi, since DINOv2 and v3 use the iBOT loss and and learn the embeddings of masked patches using surrounding patches, can we consider them also as Joint Embedding Predictive Architectures? Maybe we can think of the linear head used in the iBOT loss as their predictor in which case they use the predictor in both the student and the teacher. They use two views as opposed to I-JEPA and V-JEPA using the same image or video segment for both the student and the teacher.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DINOv2 or v3 - conceptual question #158

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

DINOv2 or v3 - conceptual question #158

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions