Summary
Every loader (scannet_loader, ca_loader, aria_loader) and the reference run_boxer.py resize input images to boxernet.hw × boxernet.hw via anisotropic stretch, and scale K accordingly so that fx *= hw/orig_w and fy *= hw/orig_h. When the input image is not already square, this produces an anisotropic K with fy/fx = orig_w/orig_h.
For ScanNet (1296×968, aspect 1.34), the resulting anisotropy (fy/fx ≈ 1.34) is within BoxerNet's training distribution and works fine. For wider cameras — notably modern phones (iPhone main rear camera records 1920×1080, aspect 1.78) — the resulting K has fy/fx ≈ 1.78, which is noticeably outside the anisotropy range BoxerNet saw during training.
Why this matters
BoxerNet's 2D→3D lifting relies on K to turn pixel coordinates into rays in camera frame. A K that's more anisotropic than anything in training means the angular footprint BoxerNet infers from a 2D bbox is biased along the "stretched" axis. The effect is systematic: all boxes get a consistent direction/distance bias.
Proposed fix
Add an optional pad-to-square preprocessing mode to the loaders:
- Pad the original image with zeros (or the dataset's background color) to
max(orig_h, orig_w)² square.
- Offset
cx/cy by the pad amount.
- Uniform (isotropic) resize the padded square to
hw × hw.
Result: fx = fy for any camera with square pixels, regardless of source aspect ratio. Black-bar padding is a well-represented augmentation in DINOv3 pretraining, so BoxerNet's backbone should handle it gracefully.
Reference implementation
The core of _build_datum in a downstream iPhone-video adapter where we've been running this:
orig_h, orig_w = img_rgb.shape[:2]
side = max(orig_h, orig_w)
pad_top = (side - orig_h) // 2
pad_left = (side - orig_w) // 2
img_square = cv2.copyMakeBorder(img_rgb, pad_top, side - orig_h - pad_top,
pad_left, side - orig_w - pad_left,
cv2.BORDER_CONSTANT, value=0)
img = cv2.resize(img_square, (hw, hw), interpolation=cv2.INTER_AREA)
# K: start from K at native resolution, shift cx/cy into the padded square,
# then uniform-resize side → hw. Result: fx == fy for square-pixel cameras.
K_padded = K_native.copy()
K_padded[0, 2] += pad_left
K_padded[1, 2] += pad_top
s = hw / side
K_boxer = K_padded.copy()
K_boxer[[0, 0, 1, 1], [0, 2, 1, 2]] *= s
I'd gate it behind a `--pad-to-square` flag (default off, preserving current behavior). If the maintainers agree on the motivation, I can open a PR that threads the flag through each loader.
What I have / don't have
Have: geometric argument + anecdotal evidence on iPhone 16:9 footage that 3D boxes moved from noticeably-drifted to substantially-tighter when we switched from stretch to pad (mean center-to-cloud distance 0.57 m → 0.28 m; but that run also included a known-K override and a change in SDP source, so pad-to-square alone is not cleanly isolated).
Don't have: a clean A/B on a CA-1M or ScanNet sequence, which is what would actually convince me. Happy to run one if the maintainers point at a held-out eval split.
Summary
Every loader (
scannet_loader,ca_loader,aria_loader) and the referencerun_boxer.pyresize input images toboxernet.hw × boxernet.hwvia anisotropic stretch, and scaleKaccordingly so thatfx *= hw/orig_wandfy *= hw/orig_h. When the input image is not already square, this produces an anisotropicKwithfy/fx = orig_w/orig_h.For ScanNet (1296×968, aspect 1.34), the resulting anisotropy (fy/fx ≈ 1.34) is within BoxerNet's training distribution and works fine. For wider cameras — notably modern phones (iPhone main rear camera records 1920×1080, aspect 1.78) — the resulting
Khasfy/fx ≈ 1.78, which is noticeably outside the anisotropy range BoxerNet saw during training.Why this matters
BoxerNet's 2D→3D lifting relies on
Kto turn pixel coordinates into rays in camera frame. AKthat's more anisotropic than anything in training means the angular footprint BoxerNet infers from a 2D bbox is biased along the "stretched" axis. The effect is systematic: all boxes get a consistent direction/distance bias.Proposed fix
Add an optional pad-to-square preprocessing mode to the loaders:
max(orig_h, orig_w)² square.cx/cyby the pad amount.hw × hw.Result:
fx = fyfor any camera with square pixels, regardless of source aspect ratio. Black-bar padding is a well-represented augmentation in DINOv3 pretraining, so BoxerNet's backbone should handle it gracefully.Reference implementation
The core of
_build_datumin a downstream iPhone-video adapter where we've been running this:I'd gate it behind a `--pad-to-square` flag (default off, preserving current behavior). If the maintainers agree on the motivation, I can open a PR that threads the flag through each loader.
What I have / don't have
Have: geometric argument + anecdotal evidence on iPhone 16:9 footage that 3D boxes moved from noticeably-drifted to substantially-tighter when we switched from stretch to pad (mean center-to-cloud distance 0.57 m → 0.28 m; but that run also included a known-K override and a change in SDP source, so pad-to-square alone is not cleanly isolated).
Don't have: a clean A/B on a CA-1M or ScanNet sequence, which is what would actually convince me. Happy to run one if the maintainers point at a held-out eval split.