Skip to content

xFormers not available → model imports OK, but infer_image returns all-zero depth map (min=0.0, max=0.0) #312

@Wasiq1123

Description

@Wasiq1123

Hi @heyoeyo — thanks for the help earlier. Uninstalling xFormers removed the import error, but now I'm seeing the model produce an all-zero depth map.

### Environment

  • OS: Ubuntu 22.04
  • Python: 3.10
  • Device: CPU only (no CUDA; torch.device -> 'cpu')
  • xFormers: uninstalled (or disabled via XFORMERS_DISABLE_MEMORY_EFFICIENT_ATTENTION=1)
  • Repo version: Depth-Anything-V2 (local copy)

### Encoder / checkpoint

  • Encoder used: vits (I also tested vitb, vitg variants)
  • Checkpoint: metric_depth/checkpoints/depth_anything_v2_metric_hypersim_vits.pth (loaded with map_location='cpu')

### Minimal repro (my code)

import cv2
import torch
from depth_anything_v2.dpt import DepthAnythingV2

DEVICE = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'

model_configs = {
    'vits': {'encoder': 'vits', 'features': 64, 'out_channels': [48, 96, 192, 384]},
    'vitb': {'encoder': 'vitb', 'features': 128, 'out_channels': [96, 192, 384, 768]},
    'vitl': {'encoder': 'vitl', 'features': 256, 'out_channels': [256, 512, 1024, 1024]},
    'vitg': {'encoder': 'vitg', 'features': 384, 'out_channels': [1536, 1536, 1536, 1536]}
}

encoder = 'vits'
dataset = 'hypersim'
model = DepthAnythingV2(**model_configs[encoder])
checkpoint_path = f'/path/to/checkpoints/depth_anything_v2_metric_{dataset}_{encoder}.pth'
model.load_state_dict(torch.load(checkpoint_path, map_location='cpu'))
model = model.to(DEVICE).eval()

image_path = "/home/wasiq/Pictures/2m_Depth_Distance.jpeg"
raw_img = cv2.imread(image_path)
depth = model.infer_image(raw_img)
print(f"The minimum depth is {depth.min()}")
print(f"The maximum depth is {depth.max()}")

**### Observed behavior**
xFormers not available printed (twice in logs)
The minimum depth is 0.0
The maximum depth is 0.0

**### Expected behavior**
Non-constant depth map with varied values (for my test image the ground-truth near a marked point is ~2.0 m)
depth.min() < depth.max() and meaningful spatial variation

**### Diagnostics I already tried**
Confirmed checkpoint loads without crash (no obvious exceptions) but I haven't validated state_dict key names yet.
Tried different encoder names (vits, vitb) consistent with checkpoint filenames.
Ensured XFORMERS_DISABLE_MEMORY_EFFICIENT_ATTENTION=1 or uninstalled xFormers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions