Skip to content

Loaders anisotropically stretch non-square images, producing out-of-distribution K for wide-aspect cameras #6

@zzhang001

Description

@zzhang001

Summary

Every loader (scannet_loader, ca_loader, aria_loader) and the reference run_boxer.py resize input images to boxernet.hw × boxernet.hw via anisotropic stretch, and scale K accordingly so that fx *= hw/orig_w and fy *= hw/orig_h. When the input image is not already square, this produces an anisotropic K with fy/fx = orig_w/orig_h.

For ScanNet (1296×968, aspect 1.34), the resulting anisotropy (fy/fx ≈ 1.34) is within BoxerNet's training distribution and works fine. For wider cameras — notably modern phones (iPhone main rear camera records 1920×1080, aspect 1.78) — the resulting K has fy/fx ≈ 1.78, which is noticeably outside the anisotropy range BoxerNet saw during training.

Why this matters

BoxerNet's 2D→3D lifting relies on K to turn pixel coordinates into rays in camera frame. A K that's more anisotropic than anything in training means the angular footprint BoxerNet infers from a 2D bbox is biased along the "stretched" axis. The effect is systematic: all boxes get a consistent direction/distance bias.

Proposed fix

Add an optional pad-to-square preprocessing mode to the loaders:

  1. Pad the original image with zeros (or the dataset's background color) to max(orig_h, orig_w)² square.
  2. Offset cx/cy by the pad amount.
  3. Uniform (isotropic) resize the padded square to hw × hw.

Result: fx = fy for any camera with square pixels, regardless of source aspect ratio. Black-bar padding is a well-represented augmentation in DINOv3 pretraining, so BoxerNet's backbone should handle it gracefully.

Reference implementation

The core of _build_datum in a downstream iPhone-video adapter where we've been running this:

orig_h, orig_w = img_rgb.shape[:2]
side = max(orig_h, orig_w)
pad_top  = (side - orig_h) // 2
pad_left = (side - orig_w) // 2
img_square = cv2.copyMakeBorder(img_rgb, pad_top, side - orig_h - pad_top,
                                pad_left, side - orig_w - pad_left,
                                cv2.BORDER_CONSTANT, value=0)
img = cv2.resize(img_square, (hw, hw), interpolation=cv2.INTER_AREA)

# K: start from K at native resolution, shift cx/cy into the padded square,
# then uniform-resize side → hw. Result: fx == fy for square-pixel cameras.
K_padded = K_native.copy()
K_padded[0, 2] += pad_left
K_padded[1, 2] += pad_top
s = hw / side
K_boxer = K_padded.copy()
K_boxer[[0, 0, 1, 1], [0, 2, 1, 2]] *= s

I'd gate it behind a `--pad-to-square` flag (default off, preserving current behavior). If the maintainers agree on the motivation, I can open a PR that threads the flag through each loader.

What I have / don't have

Have: geometric argument + anecdotal evidence on iPhone 16:9 footage that 3D boxes moved from noticeably-drifted to substantially-tighter when we switched from stretch to pad (mean center-to-cloud distance 0.57 m → 0.28 m; but that run also included a known-K override and a change in SDP source, so pad-to-square alone is not cleanly isolated).

Don't have: a clean A/B on a CA-1M or ScanNet sequence, which is what would actually convince me. Happy to run one if the maintainers point at a held-out eval split.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions