Skip to content

[VJEPA2.1] Crop size mismatch between uploaded checkpoint (384) and cooldown config (256) + NaN loss during continuous pretraining #163

@giulioalfa

Description

@giulioalfa

Environment

  • Repository: facebookresearch/vjepa2
  • Model: VJEPA2.1 (ViT-L)
  • Hardware: 2× H200 NVL GPUs
  • Dataset (test): ~22k videos (custom domain)

Bug: crop size mismatch between checkpoint and cooldown config

The checkpoint uploaded by Meta for VJEPA2.1 was trained with a crop size of 384, but the accompanying cooldown config file specifies a crop size of 256.

Affected files

  • Checkpoint: trained and saved at resolution 384×384
  • Cooldown config: crop_size: 256

Trying to reproduce the 2-step pretraining of VJEPA2.1 using the provided configs may lead to a mismatch with the released checkpoint. Moreover, would loading the 384-pretrained checkpoint and continuing training with the cooldown config as-is cause a positional embedding shape mismatch, or silently train at the wrong resolution?

Follow-up: NaN loss during continuous pretraining of VJEPA2.1

Context

I ran continuous pretraining of VJEPA2 (ViT-L) on a custom dataset and measured an accuracy improvement when comparing a fine-tuned linear probe on the base VJEPA2 checkpoint vs the same probe on my domain-pretrained VJEPA2 checkpoint. Given this result, I wanted to replicate the approach with VJEPA2.1 hoping for an even stronger encoder baseline. Unfortunately the run diverges immediately.

Observed behaviour

  • Loss: NaN after the first few iterations.
  • Dataset size (test run): ~22k videos.
  • Target dataset size (full run): ~500k videos.

What I have tried

  • Checked for inf / NaN in the input tensors before the forward pass.
  • Lowered the learning rate significantly.
  • Suspected the crop size mismatch as a potential cause (positional embedding interpolation artefacts).

Questions

  • Are the pretrain and cooldown config files intended to be used with the released checkpoint or as a standalone stage-2 config from scratch? If the former, can the crop size be corrected to 384?
  • Are there known instability issues when continuously pretraining VJEPA2.1 on small datasets (~22k videos)?
  • Is there a recommended starting LR / warmup schedule for continuous pretraining with the released VJEPA2.1 checkpoint?
  • Does the crop size mismatch (384 checkpoint --> 256 config) cause positional embedding interpolation that could lead to NaN gradients?

Additional info

The small-dataset test run (~22k videos) is intentional: I want to validate the approach before scaling to the full ~500k video corpus.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions