iPhone video to 3D scene graph — no LiDAR, no ROS bag, no Cartographer.
Turn a casual iPhone video into a set of gravity-aligned metric 3D bounding boxes for every object in the scene. Designed for robotics scene understanding, spatial AI, and offline 3D mapping.
Boxer produces state-of-the-art 3D object detection, but expects camera poses, intrinsics, and (optionally) depth as input. At work, those come from a ROS bag with Cartographer poses. For personal iPhone footage, we need a different front-end.
VGGT-SLAM 2.0 (MIT-SPARK) extends VGGT (CVPR 2025 Best Paper) with optical-flow keyframe selection, submap processing, and SL(4) pose-graph optimization + loop closure. That means it scales past VGGT's ~32-frame chunk limit to full 30-second+ videos with globally consistent trajectories.
This pipeline connects them: video in → VGGT-SLAM → Boxer → 3D scene graph out.
iPhone (no LiDAR needed) Mac Mini / Apple Silicon / CUDA box
───────────────────── ──────────────────────────────────
┌────────────────────────────────┐
Record video ──transfer──► │ 1. ffmpeg: extract frames @12fps│
│ │
│ 2. VGGT-SLAM 2.0 │
│ ├─ optical-flow keyframes │
│ ├─ VGGT per-submap inference │
│ ├─ SL(4) pose-graph opt. │
│ ├─ loop closure (DINO+SALAD) │
│ └─ globally consistent poses │
│ │
│ 3. Boxer (per keyframe) │
│ ├─ OWLv2 open-vocab 2D det. │
│ ├─ BoxerNet 3D lift │
│ │ (uses DINOv3 backbone) │
│ └─ optional fusion/tracking │
│ │
│ 4. Export │
│ └─ scene_graph.json │
└────────────────────────────────┘
Frames on disk (frame_%05d.jpg) — written by ffmpeg at configured fps
│
▼
┌─────────────────────────────────────────────────────┐
│ VGGT-SLAM 2.0 (extern/vggt_slam) │
│ │
│ Per frame, decide: keyframe? (optical-flow disparity)
│ │ │
│ ▼ │
│ Accumulate keyframes into submaps (default 16) │
│ │ │
│ ▼ │
│ VGGT (MIT-SPARK fork) per submap → pose_enc, │
│ depth, depth_conf, world_points │
│ │ │
│ ▼ │
│ Image-retrieval (DINO+SALAD) → loop-closure edges │
│ │ │
│ ▼ │
│ Pose-graph on SL(4) manifold → globally consistent │
│ homographies; decompose into K, R, t per frame │
│ │
│ Exposed per keyframe S (via _extract_output): │
│ extrinsic [S, 3, 4] camera_from_world (SE3)│
│ intrinsic [S, 3, 3] pinhole K │
│ world_points [S, H, W, 3] in SL(4) world frame│
│ world_points_conf [S, H, W] │
│ image_names [S] source paths │
└────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ pipeline/gravity.py │
│ │
│ Floor-plane RANSAC on fused VGGT points → │
│ R_align: VGGT Y-down world → Boxer Z-down world │
│ Falls back to fixed Rx(-π/2) if fit is weak │
└────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ pipeline/run_boxer.py (per keyframe) │
│ │
│ 1. Load RGB → pad to square → resize to 960×960 │
│ 2. Build K (iPhone EXIF override if --override-k- │
│ from-mov; else VGGT's estimate scaled to 960) │
│ 3. Rescale VGGT world_points for K consistency │
│ 4. Build datum: img0, cam0, T_world_rig0, sdp_w │
│ sdp_mode=global → fused cloud cropped to │
│ frustum (default; avoids per-submap scale drift)│
│ 5. OWLv2: text prompts → bb2d │
│ 6. BoxerNet: datum + bb2d → 3D OBBs in world │
│ Writes: boxer_3dbbs.csv, gravity.json │
└────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ pipeline/diagnostic.py (per-box quality filter) │
│ │
│ Computes 3 orthogonal metrics per OBB: │
│ A. dist_to_cloud (box center vs scene cloud) │
│ B. iou2d (project 3D box vs OWL bbox) │
│ C. depth_gap (box Z vs VGGT depth in bbox) │
│ │
│ --filter drops boxes exceeding any threshold │
│ Output: boxer_3dbbs_filtered.csv │
└────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ pipeline/fuse.py (per-frame → canonical objects) │
│ │
│ 1. 3D-IoU graph clustering + connected components │
│ 2. Weighted-mean per cluster (conf + logvar) │
│ 3. Post-filter against cloud: drop fused boxes │
│ whose corners are mostly in empty space │
│ │
│ Output: scene_graph_filtered_fused.json │
│ { │
│ "objects": [ │
│ { "label": "chair", "center": [x,y,z], │
│ "size": [w,h,d], "yaw": 1.23, ... }, ... │
│ ], │
│ "coordinate_system": "gravity_aligned_metric", │
│ "up_axis": "Z" │
│ } │
└────────────┬────────────────────────────────────────┘
│
▼
pipeline/ui.py (interactive rerun viewer)
or pipeline/visualize.py (static PLY)
| Step | From | To | Operation |
|---|---|---|---|
| VGGT-SLAM pose-graph homographies | SL(4) projective | SE(3) per-frame | decompose_camera (inside submap.get_all_poses_world) |
| VGGT-SLAM extrinsic → Boxer pose | camera_from_world [3,4] |
T_world_camera [4,4] |
Invert: R.T, -R.T @ t |
| VGGT-SLAM intrinsic → Boxer camera | Pinhole K [3,3] at VGGT's 294×518 |
CameraTW at 960×960 |
Scale fx,cx by 960/518, fy,cy by 960/294 |
| VGGT-SLAM points → Boxer SDP | Dense [S,H,W,3] |
Semi-dense [N,3] |
Confidence-percentile filter + random subsample + NaN pad |
| World-frame alignment | VGGT world (OpenCV Y-down) | Boxer world (Z-down) | Rx(-π/2) on all poses + world points |
- Non-Euclidean world frame. SL(4) optimization can include a small scale/shear drift relative to a strict metric world. Boxer box sizes may inherit that drift. Plan: add post-hoc Euclidean rectification via plane/gravity priors.
- Gravity is assumed, not estimated. If the iPhone starts tilted, Z-up won't hold exactly. Plan: pull gravity from iPhone IMU (CMMotionManager) or estimate via RANSAC floor plane on the fused point cloud.
- CUDA hard-wired upstream.
loop_closure.pyhasdevice = 'cuda'at module scope; our wrapper patches it at import time (_patch_vggt_slam_for_device).solver.run_predictionsalso callstorch.cuda.get_device_capability()which will need upstream attention for MPS paths.
Boxer's inference pipeline runs three separate neural nets doing three distinct jobs. Understanding what each does, where its weights live, and how they hand data off makes extern/boxer/run_boxer.py readable and explains why our adapter (pipeline/run_boxer.py) looks the way it does.
| Model | Role | Input | Output | Weights |
|---|---|---|---|---|
| OWLv2 | Open-vocabulary 2D detector | RGB image + text prompts (e.g. chair,table) |
2D boxes [M, 4] + scores + label indices |
extern/boxer/ckpts/owlv2-base-patch16-ensemble.pt (121 MB) |
| DINOv3 | Vision foundation model (frozen backbone) | RGB image | Dense per-patch feature map | extern/boxer/ckpts/dinov3_vits16plus_pretrain_lvd1689m-4057cbaa.pth (115 MB) |
| BoxerNet | Category-agnostic 3D lifter | Image + 2D boxes + pose + K + semi-dense world points | 3D oriented bounding boxes (ObbTW [M, 165]) |
extern/boxer/ckpts/boxernet_hw960in4x6d768-wssxpf9p.ckpt (400 MB) |
Key insight: DINOv3 is not a detector — it's a pure image encoder (frozen weights from Meta's self-supervised ViT-S/16+). BoxerNet calls it internally and never trains it. OWLv2 is an independent external model; its output is just data flowing into BoxerNet.
┌─────────────────────────────────────────────┐
Text labels ────────┐ │ OWLv2 (open-vocabulary 2D detection) │
"chair, table, │──►│ • CLIP-style text encoder (run once) │
sofa, ..." │ │ • ViT vision encoder │
│ • Class + box heads │
│ Output: bb2d [M, 4] in (x1,x2,y1,y2) + │
│ scores [M] + label_ints [M] │
└──────────────────┬──────────────────────────┘
│ bb2d, labels
▼
RGB frame [1,3,H,W] ─────┬────────────► ┌─────────────────────────────────┐
Pose T_world_cam ────────┤ │ BoxerNet │
Intrinsics K (CameraTW) ─┤ │ │
Semi-dense world pts ────┤ │ 1. DINOv3 backbone │
│ │ image → [num_patches, dim] │
│ │ (frozen ViT-S/16+) │
│ │ │
│ │ 2. Project world points into │
│ │ DINOv3 feature grid (via K │
│ │ and pose) → per-patch depth │
│ │ │
│ │ 3. For each bb2d, pool patch │
│ │ features inside box → query │
│ │ │
│ │ 4. Cross-attend query against │
│ │ scene tokens (img patches + │
│ │ depth patches) │
│ │ │
│ │ 5. Regression heads: │
│ │ • center in voxel frame │
│ │ • size (w, h, d) │
│ │ • yaw around gravity axis │
│ │ • log-variance (uncertainty) │
│ │ │
│ │ 6. Voxel → world coords using │
│ │ T_world_voxel (gravity- │
│ │ aligned from pose) │
│ │ │
│ │ Output: obbs_pr_w [1, M, 165] │
│ │ (ObbTW: bb3_object, T_world_ │
│ │ object, prob, sem_id, text) │
│ └─────────────────┬────────────────┘
│
┌──────────────────────────────────────┤
│ │
▼ ▼
Write CSV per frame Optional: offline fusion
(utils/file_io.ObbCsvWriter2) (utils/fuse_3d_boxes.py)
or online tracking
(utils/track_3d_boxes.py)
OWLv2 — the "what and roughly where in 2D"
- Open-vocabulary: you change the category list per run without retraining. Each text prompt is encoded once at init via a CLIP text encoder and cached; at inference only image features + cached text embeddings are consumed.
- Output convention gotcha:
OwlWrapper.forwardreturnsbb2din(x1, x2, y1, y2)order — not the standard(x1, y1, x2, y2). BoxerNet expects the same Boxer-internal order.extern/boxer/run_boxer.py:558reformats to standard order only when writing out a viewer-friendly CSV. - Alternatives: Boxer also accepts
--gt2d(ground-truth boxes from dataset annotations) or--cache2d(replay detections from a previous CSV). These bypass OWLv2.
DINOv3 — the "visual features" inside BoxerNet
- Loaded once inside
BoxerNet.__init__asself.dino = DinoV3Wrapper("dinov3_vits16plus"). - Vision-only; no text, no queries — just image in, feature map out.
- Self-supervised on 1.7 B unlabeled images (
lvd1689min the weight filename). - Frozen weights. BoxerNet training only updates the heads and attention layers on top of DINOv3 features.
- Why DINOv3 over OWL's internal CLIP ViT? DINOv3 features are denser and geometrically richer; OWL's ViT is tuned for language-grounded classification, not spatial reasoning. Using two different ViTs for two different jobs is intentional.
BoxerNet — the "lift 2D to metric 3D"
- Category-agnostic. It doesn't know "chair" from "sofa"; it just gets a 2D box and learns how boxes look in 3D given camera context.
- Two crucial scene inputs fuse the geometry:
- Camera pose + intrinsics (
T_world_rig0,cam0) → lets BoxerNet place predictions in world coords. - Semi-dense world points (
sdp_w) → a cloud of known 3D points visible in the frame. BoxerNet projects these into the DINOv3 feature grid so each image patch gets a rough depth prior. This is the single biggest signal that makes monocular 3D boxes metrically plausible.
- Camera pose + intrinsics (
- The output is gravity-aligned: yaw is around the gravity axis; the 6-DoF pose of each box is split into gravity-aligned rotation + 3D center + axis-aligned half-extents in the object frame.
- Outputs aleatoric uncertainty (log-variance) per box, used downstream for fusion weighting.
Our VGGT-SLAM output supplies exactly the geometric inputs BoxerNet needs:
| BoxerNet input | Source in VGGT-SLAM output |
|---|---|
img0 RGB |
VGGTSLAMOutput.image_names[i] loaded from disk |
cam0 CameraTW |
Built from VGGTSLAMOutput.intrinsic[i] (pinhole K, scaled to 960×960) |
T_world_rig0 PoseTW |
Invert VGGTSLAMOutput.extrinsic[i], then Rx(-π/2) for Z-down gravity |
sdp_w |
VGGTSLAMOutput.world_points[i] flattened, rotated, confidence-filtered |
bb2d |
Computed on-the-fly by OWLv2 per frame |
rotated0 |
False for iPhone landscape clips |
time_ns0 |
Derived from frame index |
All three checkpoints land in extern/boxer/ckpts/ via extern/boxer/scripts/download_ckpts.sh. BoxerNet.load_from_checkpoint wires BoxerNet and DINOv3 together; OWLv2 is loaded independently by OwlWrapper.
BoxerNet runs independently on every keyframe — so if the same physical chair is visible in 20 frames, you get 20 slightly-different 3D boxes for it (viewpoint noise, partial occlusion, OWL bbox jitter). Boxer ships two post-processing tools to collapse those redundant predictions:
Offline fusion — utils/fuse_3d_boxes.py. The right tool for static scenes (rooms, furniture, our iPhone clip). Runs after all frames are processed, reads the per-frame CSV, and:
- Builds a 3D IoU graph — boxes are nodes; an edge connects any two whose 3D IoU exceeds
iou_threshold(default 0.3). - Extracts connected components — each component is one physical object.
- Filters components with fewer than
min_detectionsframes (default 4) to drop one-shot false positives. - Within each component, computes a confidence- and variance-weighted mean of center, size, and yaw (quaternion slerp for orientation).
- Writes one fused row per cluster.
Net effect: 20 noisy chair detections → 1 crisper chair box with tighter estimates.
Online tracking — utils/track_3d_boxes.py. The right tool for videos with moving objects (robot navigation). Processes frames in temporal order, maintaining a set of active Kalman-filtered tracks. Each new 3D box is matched to an existing track by 3D IoU; unmatched boxes start new tracks; tracks that disappear for too long are terminated. Think SORT extended to 3D.
Why can't fusion do tracking's job? Offline fusion ignores time — it clusters boxes by spatial overlap regardless of when they were seen. That's wrong for moving objects (a chair pushed across the room would look like two separate clusters). Conversely, tracking ignores non-active tracks — if a chair leaves the frame and reappears 200 frames later, online tracking treats it as two different chairs.
We default to fusion for the iPhone room-scan pipeline; --track is planned for the ROS2 robot path.
TL;DR. VGGT is trained to regress K from RGB along with depth and pose — this lets it run on any camera without calibration. But for iPhone footage it systematically overestimates fx/fy by ~26%, which propagates into monocular-depth scale errors and 3D-box drift.
The data point. Our iPhone Air test clip (test/IMG_6826.MOV):
ffprobe/exiftool metadata from the .MOV:
Lens model : iPhone Air back camera 5.96mm f/1.6
Focal length (35mm equiv): 26 mm
Clean aperture : 1920×1080
This gives a real intrinsic matrix at native resolution:
| fx (px) | fy (px) | cx | cy | HFOV | |
|---|---|---|---|---|---|
| iPhone EXIF (ground truth) | 1386 | 1386 | 960 | 540 | 69.4° |
| VGGT predicted (scaled from 518×294 to 1920×1080) | 1750 | 1735 | 960 | 540 | 57.4° |
VGGT's fx is +26% larger than reality. Equivalent statement: VGGT thinks the iPhone sees a 57° HFOV when the real lens is 69° (wide-angle). This is a training-distribution mismatch — VGGT's training mix skews toward narrower indoor camera rigs (ScanNet, CO3D, Habitat renderings) than a 26mm-equiv smartphone wide-angle.
Why it matters for 3D boxes. BoxerNet's monocular 3D lifting fundamentally relies on the image-to-ray mapping defined by K. An overestimated fx means BoxerNet thinks a given 2D-bbox subtends a smaller angular footprint than it really does → compensates by pushing the box farther away along its ray → the 3D box drifts behind the actual object. The error is systematic and multiplicative: at 1.26× K overestimate, a box at true 1m depth gets placed at roughly 1.26m. That matches the remaining ~20-30 cm residual drift we observe after fixing K anisotropy and SDP confidence.
Multi-view reasoning doesn't automatically fix this. You might expect VGGT-SLAM's SL(4) bundle-ish adjustment or the global-fused SDP to "average out" per-frame K errors. But the error is systematic — every frame uses the same overestimated K, so every frame's reconstruction is uniformly stretched and the submaps align consistently at the wrong scale. No amount of averaging corrects a population-wide bias.
The fix we're prototyping. pipeline.iphone_k.extract_K_from_mov(<path>.MOV) parses MOV EXIF (lens model, 35mm equivalent) and constructs the real pinhole K at the native image resolution. run_boxer.py --override-k-from-mov substitutes this into Boxer's datum (keeping VGGT's depth + pose, but consistently rescaling VGGT's world points so they re-project correctly through the corrected K). This is a cheap, honest experiment: if boxes tighten up, known-K is necessary and sufficient; if not, the remaining drift is VGGT-depth-metric-scale which needs a depth-only replacement (e.g., Apple Depth Pro, which accepts K as input).
Why we didn't just do this from the start. VGGT-SLAM's upstream assumes full end-to-end RGB → (K, depth, pose). Pulling K out of EXIF breaks that symmetry for only one camera family. It's the right thing for our iPhone path, but not a general replacement — the override is gated behind a CLI flag.
We have no ground-truth 3D annotations for an iPhone scan, so to judge box quality we compute three metrics per OBB that each catch a different failure mode (implemented in pipeline/diagnostic.py):
For each 3D box, distance from its center to the nearest point in the VGGT-SLAM fused cloud.
dist = min_{p ∈ scene_cloud} ‖ box_center − p ‖
- Proxy for "is the box near any real surface?" Correct boxes land inside or very close to the actual room geometry, because real objects have surface points captured by VGGT.
- Cheap: 50-80k point KD-style brute force over our ~1k detections.
- Catches: boxes floating in empty space (screenshot-visible outliers — cyan box far from room).
- Misses: boxes at roughly-right angular direction but wrong depth (can still be close to some cloud point).
Project the 3D box's 8 corners into the native image using the same K and pose BoxerNet saw. Take the axis-aligned 2D bounding box of the projected corners. Compute IoU with the OWL 2D bbox that prompted BoxerNet.
corners_cam = T_cw · corners_world # 8 × 3
uv = project(corners_cam, K) # 8 × 2
projected_bb = [uv.min(0), uv.max(0)] flattened # axis-aligned xyxy
iou2d = IoU(projected_bb, owl_prompt_bbox) # scalar in [0, 1]
- Purely geometric, no ground truth needed. Tests the self-consistency of (K, pose, 3D position).
- Catches: K overestimation, pose errors, or box placed at wrong 3D direction — all of which make the reprojection miss the 2D bbox region.
- Misses: a box at the right angular direction but scaled too large/small along the ray (the 2D footprint still overlaps well).
- Requires recovering the OWL-bbox ↔ BoxerNet-output correspondence that's lost after
thresh_3dfiltering. We reconstruct it by nearest-projected-center within the same label class.
Project VGGT's semi-dense world points into the native image. Keep those landing inside the OWL 2D bbox. Compute their camera-frame Z distribution (q10/q50/q90). Compare with the camera-frame Z range spanned by the box's 8 corners.
sdp_z_in_bbox = [z for p ∈ VGGT_points
if owl_bbox contains project(p, K, pose)]
depth_gap = max(0, min_box_z − q90(sdp_z_in_bbox), # box in front of cloud
q10(sdp_z_in_bbox) − max_box_z) # box behind cloud
- Zero if the box's depth interval overlaps with where VGGT sees the object's surface. Positive = gap in meters.
- Catches exactly the "angular direction right, distance wrong" failure mode that A and B can both miss. This is the specific failure expected from systematic K bias or SL(4) scale drift, so it's the most sensitive metric for our current hypothesis.
- Requires enough VGGT points inside the bbox (default ≥30 points after 70th-percentile confidence filter) to be meaningful.
We flag a box as bad if any metric exceeds its threshold, e.g.:
dist_to_cloud > 0.8 m OR iou2d < 0.05 OR depth_gap > 0.5 m
pipeline.diagnostic --filter writes a filtered CSV that pipeline.fuse can then consume to produce a cleaner scene graph. The thresholds above are defaults; tune via CLI flags.
None of these metrics verify absolute metric scale (e.g., is the box actually 47 cm wide in the real world?). For that we'd need either a ruler/checkerboard in the scene or an Apple Depth Pro-style metric depth reference. The current metrics only verify self-consistency between BoxerNet's output and the (K, pose, SDP) geometry we fed it.
box/
├── README.md
├── pyproject.toml
├── setup.sh # One-command install (chains VGGT-SLAM's setup.sh)
│
├── extern/
│ ├── vggt_slam/ # submodule → MIT-SPARK/VGGT-SLAM 2.0
│ └── boxer/ # submodule → facebookresearch/boxer
│ └── ckpts/ # boxernet + dinov3 + owlv2 weights
│
├── pipeline/
│ ├── __init__.py
│ ├── _common.py # Shared constants: R_ALIGN_FALLBACK, detect_device
│ ├── config.py # PipelineConfig dataclass
│ ├── extract_frames.py # ffmpeg: video → frames
│ ├── run_vggt_slam.py # VGGT-SLAM 2.0 inference wrapper
│ ├── gravity.py # Floor-plane RANSAC → Z-down world alignment
│ ├── iphone_k.py # Parse iPhone MOV EXIF → real camera K
│ ├── save_owl_bbs.py # OWLv2 only, dumps per-frame 2D bboxes JSON
│ ├── run_boxer.py # Boxer (OWLv2 + BoxerNet) inference wrapper
│ ├── diagnostic.py # Three geometric quality metrics + filter
│ ├── fuse.py # Offline 3D-box fusion + corner-check post-filter
│ ├── export.py # 3D boxes → scene_graph.json
│ ├── visualize.py # VGGT-SLAM output → fused PLY (for MeshLab)
│ ├── ui.py # Rerun-based live viewer (2D + 3D synced)
│ └── run.py # End-to-end orchestrator
│
└── test/ # Local iPhone footage (gitignored — never committed)
- Python 3.11 (preferred) or 3.12 — VGGT-SLAM's upstream pins 3.11
- Mac Apple Silicon or NVIDIA GPU (Mac mini works for short clips; CUDA is still fastest)
- ffmpeg installed (
brew install ffmpeg) - ~15 GB disk for model weights (VGGT ~5 GB, DINO+SALAD ~1 GB, Boxer ~640 MB, optional PE/SAM3 larger)
- 16 GB+ unified memory (Mac) or 12 GB+ VRAM (CUDA)
# Clone with submodules
git clone --recursive https://github.com/zzhang001/box.git
cd box
# One-command setup (creates venv, chains into extern/vggt_slam/setup.sh,
# then fetches Boxer checkpoints)
./setup.sh
(cd extern/boxer && bash scripts/download_ckpts.sh)setup.sh delegates to VGGT-SLAM's own setup.sh for the SLAM stack (salad, MIT-SPARK VGGT fork, gtsam, perception-encoder, SAM3 installed into extern/vggt_slam/third_party/). Boxer's checkpoints (BoxerNet, DINOv3, OWLv2) are a separate download.
# Basic: 12 fps extraction, auto device, default Boxer labels
python -m pipeline.run --video test/IMG_6826.MOV --output output/
# Custom OWLv2 labels for a living-room scan
python -m pipeline.run \
--video test/IMG_6826.MOV \
--output output/ \
--fps 12 \ # extract 12 frames/sec
--submap-size 16 \ # VGGT submap chunk
--labels "chair,table,sofa,lamp,plant" \ # OWLv2 prompts
--device cpu # Mac mini: no CUDAThe one-liner pipeline.run does extraction → SLAM → Boxer → export, but for iPhone clips the best results come from running the full filter + fusion chain explicitly:
# 1. Extract frames (ffmpeg, 12 fps).
python -m pipeline.extract_frames --video input.mov --output output/frames/ --fps 12
# 2. VGGT-SLAM (slow; ~30-60 min on Mac mini CPU for a 30-sec clip).
python -m pipeline.run_vggt_slam --frames output/frames/ --output output/vggt_slam/
# 3. OWLv2 pass for the UI's 2D-bbox overlay (Boxer runs OWL internally,
# but it doesn't persist the boxes — this saves them for replay).
python -m pipeline.save_owl_bbs \
--vggt-slam output/vggt_slam/ \
--output output/boxer/owl_2dbbs.json \
--labels "chair,table,sofa,monitor,laptop,keyboard,lamp,plant,bookshelf,cabinet,fridge,oven,sink,microwave"
# 4. Boxer with iPhone K override + global SDP (best config on iPhone data).
python -m pipeline.run_boxer \
--vggt-slam output/vggt_slam/ \
--output output/boxer/ \
--labels "chair,table,sofa,monitor,laptop,keyboard,lamp,plant,bookshelf,cabinet,fridge,oven,sink,microwave" \
--sdp-mode global \
--override-k-from-mov input.mov
# 5. Diagnostic filter (drops boxes failing any of dist-to-cloud / IoU / depth-gap).
python -m pipeline.diagnostic \
--vggt-slam output/vggt_slam/ \
--boxer-csv output/boxer/boxer_3dbbs.csv \
--owl-json output/boxer/owl_2dbbs.json \
--out output/boxer/diagnostic \
--override-k-from-mov input.mov \
--filter --max-dist-to-cloud 0.5 --min-iou2d 0.3 --max-depth-gap 0.3
# 6. Fuse per-frame detections into canonical objects + drop off-cloud phantoms.
python -m pipeline.fuse \
--input output/boxer/boxer_3dbbs_filtered.csv \
--output-csv output/boxer/boxer_3dbbs_filtered_fused.csv \
--scene-graph output/boxer/scene_graph_filtered_fused.json \
--post-filter-vggt-slam output/vggt_slam/
# 7. Interactive viewer (opens a Rerun window).
python -m pipeline.ui \
--vggt-slam output/vggt_slam/ \
--boxer-csv output/boxer/boxer_3dbbs_filtered.csv \
--owl-json output/boxer/owl_2dbbs.json \
--scene-graph output/boxer/scene_graph_filtered_fused.json \
--save-rrd output/box_ui.rrd && rerun output/box_ui.rrd# For MeshLab / CloudCompare / 3dviewer.net
python -m pipeline.visualize --input output/vggt_slam/ --output output/vggt_slam/fused.ply
# Overlay the fused 3D boxes too (colored wireframes)
python -m pipeline.visualize \
--input output/vggt_slam/ \
--scene-graph output/boxer/scene_graph_filtered_fused.json \
--output output/vggt_slam/fused_with_boxes.ply- Record: walk slowly through a room, keep the phone roughly upright, avoid fast rotations.
- Transfer: AirDrop or USB to your Mac. Drop the
.MOVintotest/(gitignored). - Run:
python -m pipeline.run --video test/my_room.MOV --output output/ - Inspect:
output/scene_graph.jsonfor the 3D object list, or openoutput/vggt_slam/fused.plyin MeshLab.
What makes a clip work well:
- Keep FPS ≥ 10 when extracting. Below that, VGGT-SLAM's optical-flow keyframe selector loses temporal baseline. 12 fps is our default.
- Steady movement. Slow, smooth camera motion gives VGGT better depth estimates.
- Overlap. Consecutive keyframes need >60% visual overlap; the keyframe selector drops frames whose optical-flow disparity is below
--min-disparity(default 50). - Lighting. Well-lit scenes produce better 2D detections.
- Submap size. 16 frames per submap balances VGGT memory vs. pose-graph granularity. Drop to 8-12 if you OOM on MPS.
- Loop closures.
--max-loops 1(default) is fine for room-scale scans. Longer walks with revisits are handled automatically by the DINOv2+SALAD image-retrieval backbone.
| File | Size | Location | Fetched by |
|---|---|---|---|
| VGGT-1B | ~5 GB | ~/.cache/torch/hub/checkpoints/model.pt |
pipeline.run_vggt_slam at first inference |
| DINOv2 ViT-B/14 | ~330 MB | same cache dir | same |
dino_salad.ckpt |
~335 MB | same cache dir | setup.sh |
| BoxerNet + DINOv3 + OWLv2 | ~640 MB total | extern/boxer/ckpts/ |
extern/boxer/scripts/download_ckpts.sh |
Subsequent runs reuse the caches.
{
"metadata": {
"source_video": "living_room.mov",
"num_frames": 223,
"pipeline_version": "0.1.0",
"timestamp": "2026-04-17T08:00:00Z"
},
"coordinate_system": {
"type": "gravity_aligned_metric",
"up_axis": "Z",
"units": "meters"
},
"objects": [
{
"id": 0,
"label": "chair",
"center": [1.23, -0.45, 0.42],
"size": [0.55, 0.55, 0.88],
"yaw": 1.57,
"confidence": 0.92,
"uncertainty": 0.05,
"frame_idx": 17,
"time_ns": 17000000
}
]
}| Platform | Device | VGGT-SLAM | Boxer | Notes |
|---|---|---|---|---|
| macOS 15 | Mac mini (Apple Silicon) | CPU fp32 | CPU / MPS | Dev target; full 30-sec clip in ~2 hrs including ~16 VGGT submaps |
| macOS 15 | M3 Max 36 GB | MPS / CPU | MPS | Headroom for longer footage |
| Linux | RTX 4090 | CUDA | CUDA | Fastest; matches upstream benchmarks |
We keep VGGT in fp32 on CPU (not bf16) because Apple Silicon NEON lacks hardware bf16 GEMM, and PyTorch CPU bf16 falls back to scalar code that's ~50× slower than fp32 (we verified this: same 5-frame submap took >13 min in bf16 vs 17 s in fp32). CUDA/MPS keep bf16.
| Approach | LiDAR? | Global Consistency | Platform | Quality |
|---|---|---|---|---|
| This pipeline (VGGT-SLAM → Boxer) | No | Yes (SL(4) + loop closure) | Mac/Linux | High |
| ROS bag + Cartographer → Boxer | Depends | Yes | Robot rig | Production; needs bags |
| Boxer3D (iOS app) | Yes | Yes | iPhone Pro only | Highest |
| COLMAP + Boxer | No | Yes | Mac/Linux | High, but slow SfM |
Done (v0.4-post-filter):
- VGGT-SLAM 2.0 wrapper on Mac mini (CPU fp32; 471 frames → 223 keyframes, 16 submaps, 2 loop closures)
- OWLv2 + BoxerNet wired in via
pipeline.run_boxerwith pad-to-square K fix - Floor-plane RANSAC gravity alignment (
pipeline.gravity) - iPhone K override from MOV EXIF (
pipeline.iphone_k) — 26% VGGT K error corrected -
sdp_mode=globalto sidestep per-submap SL(4) scale drift - Three geometric diagnostic metrics + threshold filter (
pipeline.diagnostic) - Offline 3D-box fusion + corner-check post-filter (
pipeline.fuse) - Rerun-based interactive viewer (
pipeline.ui) - End-to-end validation on
test/IMG_6826.MOV: 33 fused objects with plausible sizes + placements
Next:
- Try Apple Depth Pro (known K input + metric depth) to replace VGGT depth for Boxer's SDP, compare against v0.4 baseline
- Post-hoc Euclidean rectification of SL(4)-warped world
- Fix VGGT-SLAM MPS device mismatch for Apple Silicon acceleration (currently CPU fp32 only)
- Streaming / online mode
- ROS2 bridge
This pipeline code is MIT licensed. Submodules have their own licenses:
- VGGT-SLAM: see extern/vggt_slam/LICENSE
- Boxer: CC-BY-NC (see extern/boxer/LICENSE)