VGGT-SLAM + Boxer Pipeline

iPhone video to 3D scene graph — no LiDAR, no ROS bag, no Cartographer.

Turn a casual iPhone video into a set of gravity-aligned metric 3D bounding boxes for every object in the scene. Designed for robotics scene understanding, spatial AI, and offline 3D mapping.

Why This Exists

Boxer produces state-of-the-art 3D object detection, but expects camera poses, intrinsics, and (optionally) depth as input. At work, those come from a ROS bag with Cartographer poses. For personal iPhone footage, we need a different front-end.

VGGT-SLAM 2.0 (MIT-SPARK) extends VGGT (CVPR 2025 Best Paper) with optical-flow keyframe selection, submap processing, and SL(4) pose-graph optimization + loop closure. That means it scales past VGGT's ~32-frame chunk limit to full 30-second+ videos with globally consistent trajectories.

This pipeline connects them: video in → VGGT-SLAM → Boxer → 3D scene graph out.

iPhone (no LiDAR needed)       Mac Mini / Apple Silicon / CUDA box
─────────────────────          ──────────────────────────────────
                                ┌────────────────────────────────┐
  Record video ──transfer──►    │ 1. ffmpeg: extract frames @12fps│
                                │                                 │
                                │ 2. VGGT-SLAM 2.0                │
                                │    ├─ optical-flow keyframes    │
                                │    ├─ VGGT per-submap inference │
                                │    ├─ SL(4) pose-graph opt.     │
                                │    ├─ loop closure (DINO+SALAD) │
                                │    └─ globally consistent poses │
                                │                                 │
                                │ 3. Boxer (per keyframe)         │
                                │    ├─ OWLv2 open-vocab 2D det.  │
                                │    ├─ BoxerNet 3D lift          │
                                │    │   (uses DINOv3 backbone)   │
                                │    └─ optional fusion/tracking  │
                                │                                 │
                                │ 4. Export                       │
                                │    └─ scene_graph.json          │
                                └────────────────────────────────┘

Architecture

Data Flow & Tensor Shapes

Frames on disk (frame_%05d.jpg)  — written by ffmpeg at configured fps
         │
         ▼
┌─────────────────────────────────────────────────────┐
│  VGGT-SLAM 2.0  (extern/vggt_slam)                  │
│                                                     │
│  Per frame, decide: keyframe? (optical-flow disparity)
│     │                                               │
│     ▼                                               │
│  Accumulate keyframes into submaps (default 16)     │
│     │                                               │
│     ▼                                               │
│  VGGT (MIT-SPARK fork) per submap → pose_enc,       │
│      depth, depth_conf, world_points                │
│     │                                               │
│     ▼                                               │
│  Image-retrieval (DINO+SALAD) → loop-closure edges  │
│     │                                               │
│     ▼                                               │
│  Pose-graph on SL(4) manifold → globally consistent │
│      homographies; decompose into K, R, t per frame │
│                                                     │
│  Exposed per keyframe S (via _extract_output):      │
│    extrinsic       [S, 3, 4] camera_from_world (SE3)│
│    intrinsic       [S, 3, 3] pinhole K              │
│    world_points    [S, H, W, 3] in SL(4) world frame│
│    world_points_conf [S, H, W]                      │
│    image_names     [S] source paths                 │
└────────────┬────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────┐
│  pipeline/gravity.py                                │
│                                                     │
│  Floor-plane RANSAC on fused VGGT points →          │
│    R_align: VGGT Y-down world → Boxer Z-down world  │
│  Falls back to fixed Rx(-π/2) if fit is weak        │
└────────────┬────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────┐
│  pipeline/run_boxer.py (per keyframe)                │
│                                                     │
│  1. Load RGB → pad to square → resize to 960×960    │
│  2. Build K (iPhone EXIF override if --override-k-  │
│     from-mov; else VGGT's estimate scaled to 960)   │
│  3. Rescale VGGT world_points for K consistency     │
│  4. Build datum: img0, cam0, T_world_rig0, sdp_w    │
│     sdp_mode=global → fused cloud cropped to        │
│     frustum (default; avoids per-submap scale drift)│
│  5. OWLv2: text prompts → bb2d                      │
│  6. BoxerNet: datum + bb2d → 3D OBBs in world       │
│  Writes: boxer_3dbbs.csv, gravity.json              │
└────────────┬────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────┐
│  pipeline/diagnostic.py (per-box quality filter)    │
│                                                     │
│  Computes 3 orthogonal metrics per OBB:             │
│    A. dist_to_cloud (box center vs scene cloud)     │
│    B. iou2d        (project 3D box vs OWL bbox)     │
│    C. depth_gap    (box Z vs VGGT depth in bbox)    │
│                                                     │
│  --filter drops boxes exceeding any threshold       │
│  Output: boxer_3dbbs_filtered.csv                   │
└────────────┬────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────┐
│  pipeline/fuse.py (per-frame → canonical objects)   │
│                                                     │
│  1. 3D-IoU graph clustering + connected components  │
│  2. Weighted-mean per cluster (conf + logvar)       │
│  3. Post-filter against cloud: drop fused boxes     │
│     whose corners are mostly in empty space         │
│                                                     │
│  Output: scene_graph_filtered_fused.json            │
│  {                                                  │
│    "objects": [                                     │
│      { "label": "chair", "center": [x,y,z],         │
│        "size": [w,h,d], "yaw": 1.23, ... }, ...     │
│    ],                                               │
│    "coordinate_system": "gravity_aligned_metric",   │
│    "up_axis": "Z"                                   │
│  }                                                  │
└────────────┬────────────────────────────────────────┘
             │
             ▼
     pipeline/ui.py (interactive rerun viewer)
     or pipeline/visualize.py (static PLY)

Key Coordinate Conversions

Step	From	To	Operation
VGGT-SLAM pose-graph homographies	SL(4) projective	SE(3) per-frame	`decompose_camera` (inside `submap.get_all_poses_world`)
VGGT-SLAM extrinsic → Boxer pose	`camera_from_world` [3,4]	`T_world_camera` [4,4]	Invert: `R.T`, `-R.T @ t`
VGGT-SLAM intrinsic → Boxer camera	Pinhole `K` [3,3] at VGGT's 294×518	`CameraTW` at 960×960	Scale `fx,cx` by `960/518`, `fy,cy` by `960/294`
VGGT-SLAM points → Boxer SDP	Dense `[S,H,W,3]`	Semi-dense `[N,3]`	Confidence-percentile filter + random subsample + NaN pad
World-frame alignment	VGGT world (OpenCV Y-down)	Boxer world (Z-down)	`Rx(-π/2)` on all poses + world points

Known Limitations (VGGT-SLAM front-end)

Non-Euclidean world frame. SL(4) optimization can include a small scale/shear drift relative to a strict metric world. Boxer box sizes may inherit that drift. Plan: add post-hoc Euclidean rectification via plane/gravity priors.
Gravity is assumed, not estimated. If the iPhone starts tilted, Z-up won't hold exactly. Plan: pull gravity from iPhone IMU (CMMotionManager) or estimate via RANSAC floor plane on the fused point cloud.
CUDA hard-wired upstream. loop_closure.py has device = 'cuda' at module scope; our wrapper patches it at import time (_patch_vggt_slam_for_device). solver.run_predictions also calls torch.cuda.get_device_capability() which will need upstream attention for MPS paths.

Inside Boxer: the Three Models

Boxer's inference pipeline runs three separate neural nets doing three distinct jobs. Understanding what each does, where its weights live, and how they hand data off makes extern/boxer/run_boxer.py readable and explains why our adapter (pipeline/run_boxer.py) looks the way it does.

Model	Role	Input	Output	Weights
OWLv2	Open-vocabulary 2D detector	RGB image + text prompts (e.g. `chair,table`)	2D boxes `[M, 4]` + scores + label indices	`extern/boxer/ckpts/owlv2-base-patch16-ensemble.pt` (121 MB)
DINOv3	Vision foundation model (frozen backbone)	RGB image	Dense per-patch feature map	`extern/boxer/ckpts/dinov3_vits16plus_pretrain_lvd1689m-4057cbaa.pth` (115 MB)
BoxerNet	Category-agnostic 3D lifter	Image + 2D boxes + pose + K + semi-dense world points	3D oriented bounding boxes (ObbTW `[M, 165]`)	`extern/boxer/ckpts/boxernet_hw960in4x6d768-wssxpf9p.ckpt` (400 MB)

Key insight: DINOv3 is not a detector — it's a pure image encoder (frozen weights from Meta's self-supervised ViT-S/16+). BoxerNet calls it internally and never trains it. OWLv2 is an independent external model; its output is just data flowing into BoxerNet.

End-to-end flow (per keyframe)

                       ┌─────────────────────────────────────────────┐
Text labels ────────┐  │  OWLv2  (open-vocabulary 2D detection)      │
  "chair, table,    │──►│  • CLIP-style text encoder (run once)        │
   sofa, ..."       │  │  • ViT vision encoder                       │
                       │  • Class + box heads                        │
                       │  Output: bb2d [M, 4] in (x1,x2,y1,y2) +     │
                       │          scores [M] + label_ints [M]         │
                       └──────────────────┬──────────────────────────┘
                                          │ bb2d, labels
                                          ▼
RGB frame [1,3,H,W] ─────┬────────────► ┌─────────────────────────────────┐
Pose T_world_cam ────────┤              │            BoxerNet              │
Intrinsics K (CameraTW) ─┤              │                                  │
Semi-dense world pts ────┤              │  1. DINOv3 backbone              │
                         │              │     image → [num_patches, dim]   │
                         │              │     (frozen ViT-S/16+)           │
                         │              │                                  │
                         │              │  2. Project world points into    │
                         │              │     DINOv3 feature grid (via K   │
                         │              │     and pose) → per-patch depth  │
                         │              │                                  │
                         │              │  3. For each bb2d, pool patch    │
                         │              │     features inside box → query  │
                         │              │                                  │
                         │              │  4. Cross-attend query against   │
                         │              │     scene tokens (img patches +  │
                         │              │     depth patches)               │
                         │              │                                  │
                         │              │  5. Regression heads:            │
                         │              │     • center in voxel frame      │
                         │              │     • size (w, h, d)             │
                         │              │     • yaw around gravity axis    │
                         │              │     • log-variance (uncertainty) │
                         │              │                                  │
                         │              │  6. Voxel → world coords using   │
                         │              │     T_world_voxel (gravity-      │
                         │              │     aligned from pose)           │
                         │              │                                  │
                         │              │  Output: obbs_pr_w [1, M, 165]   │
                         │              │    (ObbTW: bb3_object, T_world_  │
                         │              │    object, prob, sem_id, text)   │
                         │              └─────────────────┬────────────────┘
                                                          │
                   ┌──────────────────────────────────────┤
                   │                                      │
                   ▼                                      ▼
        Write CSV per frame                 Optional: offline fusion
        (utils/file_io.ObbCsvWriter2)       (utils/fuse_3d_boxes.py)
                                            or online tracking
                                            (utils/track_3d_boxes.py)

What each model contributes

OWLv2 — the "what and roughly where in 2D"

Open-vocabulary: you change the category list per run without retraining. Each text prompt is encoded once at init via a CLIP text encoder and cached; at inference only image features + cached text embeddings are consumed.
Output convention gotcha: OwlWrapper.forward returns bb2d in (x1, x2, y1, y2) order — not the standard (x1, y1, x2, y2). BoxerNet expects the same Boxer-internal order. extern/boxer/run_boxer.py:558 reformats to standard order only when writing out a viewer-friendly CSV.
Alternatives: Boxer also accepts --gt2d (ground-truth boxes from dataset annotations) or --cache2d (replay detections from a previous CSV). These bypass OWLv2.

DINOv3 — the "visual features" inside BoxerNet

Loaded once inside BoxerNet.__init__ as self.dino = DinoV3Wrapper("dinov3_vits16plus").
Vision-only; no text, no queries — just image in, feature map out.
Self-supervised on 1.7 B unlabeled images (lvd1689m in the weight filename).
Frozen weights. BoxerNet training only updates the heads and attention layers on top of DINOv3 features.
Why DINOv3 over OWL's internal CLIP ViT? DINOv3 features are denser and geometrically richer; OWL's ViT is tuned for language-grounded classification, not spatial reasoning. Using two different ViTs for two different jobs is intentional.

BoxerNet — the "lift 2D to metric 3D"

Category-agnostic. It doesn't know "chair" from "sofa"; it just gets a 2D box and learns how boxes look in 3D given camera context.
Two crucial scene inputs fuse the geometry:
1. Camera pose + intrinsics (T_world_rig0, cam0) → lets BoxerNet place predictions in world coords.
2. Semi-dense world points (sdp_w) → a cloud of known 3D points visible in the frame. BoxerNet projects these into the DINOv3 feature grid so each image patch gets a rough depth prior. This is the single biggest signal that makes monocular 3D boxes metrically plausible.
The output is gravity-aligned: yaw is around the gravity axis; the 6-DoF pose of each box is split into gravity-aligned rotation + 3D center + axis-aligned half-extents in the object frame.
Outputs aleatoric uncertainty (log-variance) per box, used downstream for fusion weighting.

Where the three plug into our pipeline

Our VGGT-SLAM output supplies exactly the geometric inputs BoxerNet needs:

BoxerNet input	Source in VGGT-SLAM output
`img0` RGB	`VGGTSLAMOutput.image_names[i]` loaded from disk
`cam0` CameraTW	Built from `VGGTSLAMOutput.intrinsic[i]` (pinhole K, scaled to 960×960)
`T_world_rig0` PoseTW	Invert `VGGTSLAMOutput.extrinsic[i]`, then `Rx(-π/2)` for Z-down gravity
`sdp_w`	`VGGTSLAMOutput.world_points[i]` flattened, rotated, confidence-filtered
`bb2d`	Computed on-the-fly by OWLv2 per frame
`rotated0`	`False` for iPhone landscape clips
`time_ns0`	Derived from frame index

All three checkpoints land in extern/boxer/ckpts/ via extern/boxer/scripts/download_ckpts.sh. BoxerNet.load_from_checkpoint wires BoxerNet and DINOv3 together; OWLv2 is loaded independently by OwlWrapper.

Post-processing: fusion vs. tracking

BoxerNet runs independently on every keyframe — so if the same physical chair is visible in 20 frames, you get 20 slightly-different 3D boxes for it (viewpoint noise, partial occlusion, OWL bbox jitter). Boxer ships two post-processing tools to collapse those redundant predictions:

Offline fusion — utils/fuse_3d_boxes.py. The right tool for static scenes (rooms, furniture, our iPhone clip). Runs after all frames are processed, reads the per-frame CSV, and:

Builds a 3D IoU graph — boxes are nodes; an edge connects any two whose 3D IoU exceeds iou_threshold (default 0.3).
Extracts connected components — each component is one physical object.
Filters components with fewer than min_detections frames (default 4) to drop one-shot false positives.
Within each component, computes a confidence- and variance-weighted mean of center, size, and yaw (quaternion slerp for orientation).
Writes one fused row per cluster.

Net effect: 20 noisy chair detections → 1 crisper chair box with tighter estimates.

Online tracking — utils/track_3d_boxes.py. The right tool for videos with moving objects (robot navigation). Processes frames in temporal order, maintaining a set of active Kalman-filtered tracks. Each new 3D box is matched to an existing track by 3D IoU; unmatched boxes start new tracks; tracks that disappear for too long are terminated. Think SORT extended to 3D.

Why can't fusion do tracking's job? Offline fusion ignores time — it clusters boxes by spatial overlap regardless of when they were seen. That's wrong for moving objects (a chair pushed across the room would look like two separate clusters). Conversely, tracking ignores non-active tracks — if a chair leaves the frame and reappears 200 frames later, online tracking treats it as two different chairs.

We default to fusion for the iPhone room-scan pipeline; --track is planned for the ROS2 robot path.

Camera intrinsics: why VGGT's estimate isn't good enough

TL;DR. VGGT is trained to regress K from RGB along with depth and pose — this lets it run on any camera without calibration. But for iPhone footage it systematically overestimates fx/fy by ~26%, which propagates into monocular-depth scale errors and 3D-box drift.

The data point. Our iPhone Air test clip (test/IMG_6826.MOV):

ffprobe/exiftool metadata from the .MOV:
  Lens model               : iPhone Air back camera 5.96mm f/1.6
  Focal length (35mm equiv): 26 mm
  Clean aperture           : 1920×1080

This gives a real intrinsic matrix at native resolution:

	fx (px)	fy (px)	cx	cy	HFOV
iPhone EXIF (ground truth)	1386	1386	960	540	69.4°
VGGT predicted (scaled from 518×294 to 1920×1080)	1750	1735	960	540	57.4°

VGGT's fx is +26% larger than reality. Equivalent statement: VGGT thinks the iPhone sees a 57° HFOV when the real lens is 69° (wide-angle). This is a training-distribution mismatch — VGGT's training mix skews toward narrower indoor camera rigs (ScanNet, CO3D, Habitat renderings) than a 26mm-equiv smartphone wide-angle.

Why it matters for 3D boxes. BoxerNet's monocular 3D lifting fundamentally relies on the image-to-ray mapping defined by K. An overestimated fx means BoxerNet thinks a given 2D-bbox subtends a smaller angular footprint than it really does → compensates by pushing the box farther away along its ray → the 3D box drifts behind the actual object. The error is systematic and multiplicative: at 1.26× K overestimate, a box at true 1m depth gets placed at roughly 1.26m. That matches the remaining ~20-30 cm residual drift we observe after fixing K anisotropy and SDP confidence.

Multi-view reasoning doesn't automatically fix this. You might expect VGGT-SLAM's SL(4) bundle-ish adjustment or the global-fused SDP to "average out" per-frame K errors. But the error is systematic — every frame uses the same overestimated K, so every frame's reconstruction is uniformly stretched and the submaps align consistently at the wrong scale. No amount of averaging corrects a population-wide bias.

The fix we're prototyping. pipeline.iphone_k.extract_K_from_mov(<path>.MOV) parses MOV EXIF (lens model, 35mm equivalent) and constructs the real pinhole K at the native image resolution. run_boxer.py --override-k-from-mov substitutes this into Boxer's datum (keeping VGGT's depth + pose, but consistently rescaling VGGT's world points so they re-project correctly through the corrected K). This is a cheap, honest experiment: if boxes tighten up, known-K is necessary and sufficient; if not, the remaining drift is VGGT-depth-metric-scale which needs a depth-only replacement (e.g., Apple Depth Pro, which accepts K as input).

Why we didn't just do this from the start. VGGT-SLAM's upstream assumes full end-to-end RGB → (K, depth, pose). Pulling K out of EXIF breaks that symmetry for only one camera family. It's the right thing for our iPhone path, but not a general replacement — the override is gated behind a CLI flag.

Diagnostic metrics: three orthogonal geometric checks

We have no ground-truth 3D annotations for an iPhone scan, so to judge box quality we compute three metrics per OBB that each catch a different failure mode (implemented in pipeline/diagnostic.py):

A. `dist_to_cloud` — does the box live in the scene at all?

For each 3D box, distance from its center to the nearest point in the VGGT-SLAM fused cloud.

dist = min_{p ∈ scene_cloud} ‖ box_center − p ‖

Proxy for "is the box near any real surface?" Correct boxes land inside or very close to the actual room geometry, because real objects have surface points captured by VGGT.
Cheap: 50-80k point KD-style brute force over our ~1k detections.
Catches: boxes floating in empty space (screenshot-visible outliers — cyan box far from room).
Misses: boxes at roughly-right angular direction but wrong depth (can still be close to some cloud point).

B. `iou2d` — is the 3D box self-consistent with (K, pose)?

Project the 3D box's 8 corners into the native image using the same K and pose BoxerNet saw. Take the axis-aligned 2D bounding box of the projected corners. Compute IoU with the OWL 2D bbox that prompted BoxerNet.

corners_cam  = T_cw · corners_world                  # 8 × 3
uv           = project(corners_cam, K)               # 8 × 2
projected_bb = [uv.min(0), uv.max(0)] flattened      # axis-aligned xyxy
iou2d        = IoU(projected_bb, owl_prompt_bbox)    # scalar in [0, 1]

Purely geometric, no ground truth needed. Tests the self-consistency of (K, pose, 3D position).
Catches: K overestimation, pose errors, or box placed at wrong 3D direction — all of which make the reprojection miss the 2D bbox region.
Misses: a box at the right angular direction but scaled too large/small along the ray (the 2D footprint still overlaps well).
Requires recovering the OWL-bbox ↔ BoxerNet-output correspondence that's lost after thresh_3d filtering. We reconstruct it by nearest-projected-center within the same label class.

C. `depth_gap` — is the box at the same depth as what VGGT sees in that bbox?

Project VGGT's semi-dense world points into the native image. Keep those landing inside the OWL 2D bbox. Compute their camera-frame Z distribution (q10/q50/q90). Compare with the camera-frame Z range spanned by the box's 8 corners.

sdp_z_in_bbox = [z for p ∈ VGGT_points
                 if owl_bbox contains project(p, K, pose)]
depth_gap = max(0, min_box_z − q90(sdp_z_in_bbox),    # box in front of cloud
                     q10(sdp_z_in_bbox) − max_box_z)  # box behind cloud

Zero if the box's depth interval overlaps with where VGGT sees the object's surface. Positive = gap in meters.
Catches exactly the "angular direction right, distance wrong" failure mode that A and B can both miss. This is the specific failure expected from systematic K bias or SL(4) scale drift, so it's the most sensitive metric for our current hypothesis.
Requires enough VGGT points inside the bbox (default ≥30 points after 70th-percentile confidence filter) to be meaningful.

Combining the three

We flag a box as bad if any metric exceeds its threshold, e.g.:

dist_to_cloud > 0.8 m  OR  iou2d < 0.05  OR  depth_gap > 0.5 m

pipeline.diagnostic --filter writes a filtered CSV that pipeline.fuse can then consume to produce a cleaner scene graph. The thresholds above are defaults; tune via CLI flags.

What's still missing

None of these metrics verify absolute metric scale (e.g., is the box actually 47 cm wide in the real world?). For that we'd need either a ruler/checkerboard in the scene or an Apple Depth Pro-style metric depth reference. The current metrics only verify self-consistency between BoxerNet's output and the (K, pose, SDP) geometry we fed it.

Project Structure

box/
├── README.md
├── pyproject.toml
├── setup.sh                    # One-command install (chains VGGT-SLAM's setup.sh)
│
├── extern/
│   ├── vggt_slam/              # submodule → MIT-SPARK/VGGT-SLAM 2.0
│   └── boxer/                  # submodule → facebookresearch/boxer
│       └── ckpts/              # boxernet + dinov3 + owlv2 weights
│
├── pipeline/
│   ├── __init__.py
│   ├── _common.py              # Shared constants: R_ALIGN_FALLBACK, detect_device
│   ├── config.py               # PipelineConfig dataclass
│   ├── extract_frames.py       # ffmpeg: video → frames
│   ├── run_vggt_slam.py        # VGGT-SLAM 2.0 inference wrapper
│   ├── gravity.py              # Floor-plane RANSAC → Z-down world alignment
│   ├── iphone_k.py             # Parse iPhone MOV EXIF → real camera K
│   ├── save_owl_bbs.py         # OWLv2 only, dumps per-frame 2D bboxes JSON
│   ├── run_boxer.py            # Boxer (OWLv2 + BoxerNet) inference wrapper
│   ├── diagnostic.py           # Three geometric quality metrics + filter
│   ├── fuse.py                 # Offline 3D-box fusion + corner-check post-filter
│   ├── export.py               # 3D boxes → scene_graph.json
│   ├── visualize.py            # VGGT-SLAM output → fused PLY (for MeshLab)
│   ├── ui.py                   # Rerun-based live viewer (2D + 3D synced)
│   └── run.py                  # End-to-end orchestrator
│
└── test/                       # Local iPhone footage (gitignored — never committed)

Prerequisites

Python 3.11 (preferred) or 3.12 — VGGT-SLAM's upstream pins 3.11
Mac Apple Silicon or NVIDIA GPU (Mac mini works for short clips; CUDA is still fastest)
ffmpeg installed (brew install ffmpeg)
~15 GB disk for model weights (VGGT ~5 GB, DINO+SALAD ~1 GB, Boxer ~640 MB, optional PE/SAM3 larger)
16 GB+ unified memory (Mac) or 12 GB+ VRAM (CUDA)

Installation

# Clone with submodules
git clone --recursive https://github.com/zzhang001/box.git
cd box

# One-command setup (creates venv, chains into extern/vggt_slam/setup.sh,
# then fetches Boxer checkpoints)
./setup.sh
(cd extern/boxer && bash scripts/download_ckpts.sh)

setup.sh delegates to VGGT-SLAM's own setup.sh for the SLAM stack (salad, MIT-SPARK VGGT fork, gtsam, perception-encoder, SAM3 installed into extern/vggt_slam/third_party/). Boxer's checkpoints (BoxerNet, DINOv3, OWLv2) are a separate download.

Usage

Full Pipeline (video → scene graph)

# Basic: 12 fps extraction, auto device, default Boxer labels
python -m pipeline.run --video test/IMG_6826.MOV --output output/

# Custom OWLv2 labels for a living-room scan
python -m pipeline.run \
    --video test/IMG_6826.MOV \
    --output output/ \
    --fps 12 \                                    # extract 12 frames/sec
    --submap-size 16 \                            # VGGT submap chunk
    --labels "chair,table,sofa,lamp,plant" \      # OWLv2 prompts
    --device cpu                                  # Mac mini: no CUDA

Step-by-Step (full quality pipeline)

The one-liner pipeline.run does extraction → SLAM → Boxer → export, but for iPhone clips the best results come from running the full filter + fusion chain explicitly:

# 1. Extract frames (ffmpeg, 12 fps).
python -m pipeline.extract_frames --video input.mov --output output/frames/ --fps 12

# 2. VGGT-SLAM (slow; ~30-60 min on Mac mini CPU for a 30-sec clip).
python -m pipeline.run_vggt_slam --frames output/frames/ --output output/vggt_slam/

# 3. OWLv2 pass for the UI's 2D-bbox overlay (Boxer runs OWL internally,
#    but it doesn't persist the boxes — this saves them for replay).
python -m pipeline.save_owl_bbs \
    --vggt-slam output/vggt_slam/ \
    --output output/boxer/owl_2dbbs.json \
    --labels "chair,table,sofa,monitor,laptop,keyboard,lamp,plant,bookshelf,cabinet,fridge,oven,sink,microwave"

# 4. Boxer with iPhone K override + global SDP (best config on iPhone data).
python -m pipeline.run_boxer \
    --vggt-slam output/vggt_slam/ \
    --output output/boxer/ \
    --labels "chair,table,sofa,monitor,laptop,keyboard,lamp,plant,bookshelf,cabinet,fridge,oven,sink,microwave" \
    --sdp-mode global \
    --override-k-from-mov input.mov

# 5. Diagnostic filter (drops boxes failing any of dist-to-cloud / IoU / depth-gap).
python -m pipeline.diagnostic \
    --vggt-slam output/vggt_slam/ \
    --boxer-csv output/boxer/boxer_3dbbs.csv \
    --owl-json output/boxer/owl_2dbbs.json \
    --out output/boxer/diagnostic \
    --override-k-from-mov input.mov \
    --filter --max-dist-to-cloud 0.5 --min-iou2d 0.3 --max-depth-gap 0.3

# 6. Fuse per-frame detections into canonical objects + drop off-cloud phantoms.
python -m pipeline.fuse \
    --input output/boxer/boxer_3dbbs_filtered.csv \
    --output-csv output/boxer/boxer_3dbbs_filtered_fused.csv \
    --scene-graph output/boxer/scene_graph_filtered_fused.json \
    --post-filter-vggt-slam output/vggt_slam/

# 7. Interactive viewer (opens a Rerun window).
python -m pipeline.ui \
    --vggt-slam output/vggt_slam/ \
    --boxer-csv output/boxer/boxer_3dbbs_filtered.csv \
    --owl-json output/boxer/owl_2dbbs.json \
    --scene-graph output/boxer/scene_graph_filtered_fused.json \
    --save-rrd output/box_ui.rrd && rerun output/box_ui.rrd

Visualizing the fused point cloud as a static PLY

# For MeshLab / CloudCompare / 3dviewer.net
python -m pipeline.visualize --input output/vggt_slam/ --output output/vggt_slam/fused.ply

# Overlay the fused 3D boxes too (colored wireframes)
python -m pipeline.visualize \
    --input output/vggt_slam/ \
    --scene-graph output/boxer/scene_graph_filtered_fused.json \
    --output output/vggt_slam/fused_with_boxes.ply

Tips for capturing a good iPhone clip

Record: walk slowly through a room, keep the phone roughly upright, avoid fast rotations.
Transfer: AirDrop or USB to your Mac. Drop the .MOV into test/ (gitignored).
Run: python -m pipeline.run --video test/my_room.MOV --output output/
Inspect: output/scene_graph.json for the 3D object list, or open output/vggt_slam/fused.ply in MeshLab.

What makes a clip work well:

Keep FPS ≥ 10 when extracting. Below that, VGGT-SLAM's optical-flow keyframe selector loses temporal baseline. 12 fps is our default.
Steady movement. Slow, smooth camera motion gives VGGT better depth estimates.
Overlap. Consecutive keyframes need >60% visual overlap; the keyframe selector drops frames whose optical-flow disparity is below --min-disparity (default 50).
Lighting. Well-lit scenes produce better 2D detections.
Submap size. 16 frames per submap balances VGGT memory vs. pose-graph granularity. Drop to 8-12 if you OOM on MPS.
Loop closures. --max-loops 1 (default) is fine for room-scale scans. Longer walks with revisits are handled automatically by the DINOv2+SALAD image-retrieval backbone.

Weights downloaded on first run

File	Size	Location	Fetched by
VGGT-1B	~5 GB	`~/.cache/torch/hub/checkpoints/model.pt`	`pipeline.run_vggt_slam` at first inference
DINOv2 ViT-B/14	~330 MB	same cache dir	same
`dino_salad.ckpt`	~335 MB	same cache dir	`setup.sh`
BoxerNet + DINOv3 + OWLv2	~640 MB total	`extern/boxer/ckpts/`	`extern/boxer/scripts/download_ckpts.sh`

Subsequent runs reuse the caches.

Output Format

{
  "metadata": {
    "source_video": "living_room.mov",
    "num_frames": 223,
    "pipeline_version": "0.1.0",
    "timestamp": "2026-04-17T08:00:00Z"
  },
  "coordinate_system": {
    "type": "gravity_aligned_metric",
    "up_axis": "Z",
    "units": "meters"
  },
  "objects": [
    {
      "id": 0,
      "label": "chair",
      "center": [1.23, -0.45, 0.42],
      "size": [0.55, 0.55, 0.88],
      "yaw": 1.57,
      "confidence": 0.92,
      "uncertainty": 0.05,
      "frame_idx": 17,
      "time_ns": 17000000
    }
  ]
}

Hardware Tested

Platform	Device	VGGT-SLAM	Boxer	Notes
macOS 15	Mac mini (Apple Silicon)	CPU fp32	CPU / MPS	Dev target; full 30-sec clip in ~2 hrs including ~16 VGGT submaps
macOS 15	M3 Max 36 GB	MPS / CPU	MPS	Headroom for longer footage
Linux	RTX 4090	CUDA	CUDA	Fastest; matches upstream benchmarks

We keep VGGT in fp32 on CPU (not bf16) because Apple Silicon NEON lacks hardware bf16 GEMM, and PyTorch CPU bf16 falls back to scalar code that's ~50× slower than fp32 (we verified this: same 5-frame submap took >13 min in bf16 vs 17 s in fp32). CUDA/MPS keep bf16.

How It Compares

Approach	LiDAR?	Global Consistency	Platform	Quality
This pipeline (VGGT-SLAM → Boxer)	No	Yes (SL(4) + loop closure)	Mac/Linux	High
ROS bag + Cartographer → Boxer	Depends	Yes	Robot rig	Production; needs bags
Boxer3D (iOS app)	Yes	Yes	iPhone Pro only	Highest
COLMAP + Boxer	No	Yes	Mac/Linux	High, but slow SfM

Roadmap

Done (v0.4-post-filter):

VGGT-SLAM 2.0 wrapper on Mac mini (CPU fp32; 471 frames → 223 keyframes, 16 submaps, 2 loop closures)
OWLv2 + BoxerNet wired in via pipeline.run_boxer with pad-to-square K fix
Floor-plane RANSAC gravity alignment (pipeline.gravity)
iPhone K override from MOV EXIF (pipeline.iphone_k) — 26% VGGT K error corrected
sdp_mode=global to sidestep per-submap SL(4) scale drift
Three geometric diagnostic metrics + threshold filter (pipeline.diagnostic)
Offline 3D-box fusion + corner-check post-filter (pipeline.fuse)
Rerun-based interactive viewer (pipeline.ui)
End-to-end validation on test/IMG_6826.MOV: 33 fused objects with plausible sizes + placements

Next:

Try Apple Depth Pro (known K input + metric depth) to replace VGGT depth for Boxer's SDP, compare against v0.4 baseline
Post-hoc Euclidean rectification of SL(4)-warped world
Fix VGGT-SLAM MPS device mismatch for Apple Silicon acceleration (currently CPU fp32 only)
Streaming / online mode
ROS2 bridge

License

This pipeline code is MIT licensed. Submodules have their own licenses:

VGGT-SLAM: see extern/vggt_slam/LICENSE
Boxer: CC-BY-NC (see extern/boxer/LICENSE)

Acknowledgments

VGGT-SLAM — Dominic Maggio et al., MIT-SPARK Lab
Boxer — Daniel DeTone et al., Meta Reality Labs
OWLv2 — Matthias Minderer et al., Google Research
DINOv3 — Maxime Oquab et al., Meta FAIR

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
extern		extern
pipeline		pipeline
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
pyproject.toml		pyproject.toml
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

VGGT-SLAM + Boxer Pipeline

Why This Exists

Architecture

Data Flow & Tensor Shapes

Key Coordinate Conversions

Known Limitations (VGGT-SLAM front-end)

Inside Boxer: the Three Models

End-to-end flow (per keyframe)

What each model contributes

Where the three plug into our pipeline

Post-processing: fusion vs. tracking

Camera intrinsics: why VGGT's estimate isn't good enough

Diagnostic metrics: three orthogonal geometric checks

A. dist_to_cloud — does the box live in the scene at all?

B. iou2d — is the 3D box self-consistent with (K, pose)?

C. depth_gap — is the box at the same depth as what VGGT sees in that bbox?

Combining the three

What's still missing

Project Structure

Prerequisites

Installation

Usage

Full Pipeline (video → scene graph)

Step-by-Step (full quality pipeline)

Visualizing the fused point cloud as a static PLY

Tips for capturing a good iPhone clip

Weights downloaded on first run

Output Format

Hardware Tested

How It Compares

Roadmap

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

A. `dist_to_cloud` — does the box live in the scene at all?

B. `iou2d` — is the 3D box self-consistent with (K, pose)?

C. `depth_gap` — is the box at the same depth as what VGGT sees in that bbox?

Packages