A Web-Scale 4D Hand-Object Interaction Data Engine for Any-View Robot Retargeting and Video-to-Action Robot Learning
Authors: Gaotian Wang¹, Kejia Ren¹, Andrew Morgan², Yiting Chen¹, Howard H. Qian¹, Podshara Chanrungmaneekul¹, Kaiyu Hang¹
¹ Rice University · ² Robotics and AI Institute
Metric 3D hand-object tracking from a single static-camera RGB video — no depth sensor, no calibration required. Outputs per-frame 21-joint hand poses, MANO meshes, per-object 6DoF pose sequences with reconstructed 3D mesh, and an interactive Viser 3D viewer.
| Stage | What | Backbone |
|---|---|---|
| A | Metric depth + focal length | MoGe-2 (ViT-L) |
| A-grav | World up direction | GeoCalib |
| B | Hand detection + MANO recon | YOLO + WiLoR (DINOv2-L) |
| C | Optical flow, depth stabilization, alignment, smoothing, joint clamps | MEMFOF, SavGol, biomech swing-twist |
| C+ | Missing-frame motion infiller | HaWoR Transformer |
| D-sam3 | Text-prompted object detection + tracking | SAM3.1 + SAM2 |
| D-sam3d | Single-image 3D object reconstruction (Gaussian splat PLY) | SAM 3D Objects |
| D-track | 6DoF mesh pose tracking (FGR + ICP anchor → flow + RANSAC PnP propagation → LBFGS smoothing) | Open3D + MEMFOF + custom |
| Doc | Covers |
|---|---|
| docs/PIPELINE.md | Pipeline architecture deep-dive — every phase (A → D-track), the canonical post-tracking sequence, PKL v2 format, data layout, hardware budget |
| docs/ARCHITECTURE.md | The egoinfinity/ orchestration layer — stage / registry / runner design, resume + clip×stage selectors, and swapping Phase-1 backbones (depth / gravity / hand) |
| docs/MANIFEST.md | manifest.json field reference — the per-clip input (video_uri, objects, …) |
| docs/THIRD_PARTY.md | Third-party deps, weights & licenses + Action100M dataset acquisition |
| docs/MULTI_HOST.md | Running across machines — e.g. offload SAM 3D Objects to a bigger GPU when a 16 GB card can't fit it |
# 1. Clone
git clone https://github.com/Rice-RobotPI-Lab/EgoInfinity.git
cd EgoInfinity
# 2. Conda env (single env for pipeline + retarget)
conda create -n egoinfinity python=3.10 -y
conda activate egoinfinity
# 3. PyTorch (match your CUDA — cu124 default)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
# 4. Repo + deps (editable install)
pip install -e . # pulls everything in pyproject.toml
pip install git+https://github.com/microsoft/MoGe.git
pip install git+https://github.com/cvg/GeoCalib.git
# 4b. Retarget deps (same env; skip if you don't need robot retargeting)
pip install mujoco==3.6.0 pytorch-kinematics==0.9.1 roma "PyOpenGL>=3.1.10"
# 5. Fetch WiLoR source (cloned + patched; CC-BY-NC-ND, not redistributed)
# + download pretrained weights (WiLoR detector + ckpt, SAM2, infiller)
bash scripts/setup_weights.sh
# 6. MANO model — manual download required (see "MANO model" below)
# 7. Environment setup
cp activate.sh.example ~/egoinfinity_env.sh # then edit paths
source ~/egoinfinity_env.sh
# 8. Run the pipeline on a clip directory (frames/ + manifest.json)
python -m egoinfinity process /path/to/clip_dir/
# add --robot g1 to also retarget to a robot (see "Retarget" below)A clip directory needs
frames/(extracted RGB) +manifest.json({"video_uri": ..., "objects": ["red mug", ...]}). See docs/MANIFEST.md for the field reference and docs/PIPELINE.md for the full architecture.
System dep: WiLoR's renderer uses
pyrenderwith the EGL backend. On Debian/Ubuntu installlibegl1 libgles2 libglvnd0once. Headless servers don't need an X display.
After running the pipeline once on a clip, you can resume / re-run any stage of the post-tracking sequence (anchor, depth smooth, FP++ bake, spurious detection, etc.) via the unified pipeline entry — no separate "refresh" CLI needed.
# Default: resume the pipeline. Skips stages already marked done in
# state.json, runs whatever's pending.
python -m egoinfinity process /path/to/artifacts/CLIP_ID/
# Force a specific stage and everything downstream.
python -m egoinfinity process /path/to/artifacts/CLIP_ID/ \
--force pose_track_p1 --cascade
# Run one or several stages explicitly (force-run; comma-separated).
python -m egoinfinity run bake_fp /path/to/artifacts/CLIP_ID/
python -m egoinfinity run grasp_veto,pose_track_p2 /path/to/artifacts/CLIP_ID/
# Inspect status (which stages are done, which are stale).
python -m egoinfinity status /path/to/artifacts/CLIP_ID/run force-runs the named stage(s); a stage's upstream must already be done,
else it errors (pass --with-deps to auto-run the missing upstream first).
The 8-stage canonical post-tracking sequence (see docs/PIPELINE.md §4.4):
pose_track_p1 → grasp_veto → pose_track_p2 → depth_align → depth_smooth → bake_fp → scale_sanity → spurious.
You can also re-run a phase1-internal substep on an existing pkl (phase1
is monolithic, so these are the only fine-grained re-runs for Phase B/C/D-sam3d):
refresh_sam3d, refresh_hands, refresh_hand_scale, refresh_optical_flow.
E.g. cross-host SAM3D fill: egoinfinity run refresh_sam3d CLIP/ (see docs/MULTI_HOST.md).
Both process and run accept several clip dirs, or a single batch root
whose subdirs are clip dirs (filter with --only). Per-clip failures are
isolated (a summary is printed; --fail-fast stops at the first):
python -m egoinfinity process clipA/ clipB/ # several clips
python -m egoinfinity process favorites/ --only A,B # batch root, subset
python -m egoinfinity run scale_sanity favorites/ # one stage, all clipsThe individual algorithm scripts (egoinfinity/pipeline/post_tracking/pose_tracking.py,
etc.) and tools/rerun/refresh_*.py still run standalone for back-compat,
but the pipeline runner is the recommended entry.
MANO is licensed for non-commercial research use (MANO license) — we cannot redistribute the weights. Follow these steps once:
- Register at https://mano.is.tue.mpg.de (free, requires email + agreement).
- After approval, download "Models & Code" →
mano_v1_2.zip. - Extract
models/MANO_RIGHT.pkltothird_party/wilor/mano_data/:unzip mano_v1_2.zip cp mano_v1_2/models/MANO_RIGHT.pkl third_party/wilor/mano_data/
Without this file Phase B (hand reconstruction) cannot run. This is the same convention as HaWoR and WiLoR upstream.
SAM3.1 lives in a separate conda env (different Python / PyTorch / CUDA versions). Weights are gated; request access first at https://huggingface.co/facebook/sam3.1.
# Sibling repo (clone next to EgoInfinity/)
cd ..
git clone https://github.com/facebookresearch/sam3.git
cd EgoInfinity
# Create the dedicated env
conda create -n sam3 python=3.12 -y
SAM3_PIP=$(conda info --base)/envs/sam3/bin/pip
$SAM3_PIP install torch==2.10.0 torchvision --index-url https://download.pytorch.org/whl/cu128
$SAM3_PIP install -e ../sam3
$SAM3_PIP install 'setuptools<80' # pkg_resources compat
# Authenticate with HuggingFace once (after access is granted)
$(conda info --base)/envs/sam3/bin/hf auth loginPipeline auto-detects SAM3 at ../sam3/. Override via env vars:
export SAM3_PYTHON=/path/to/sam3/env/bin/python
export SAM3_REPO=/path/to/sam3-repocd ..
git clone https://github.com/facebookresearch/sam-3d-objects.git
# follow that repo's installation guide (separate sam3d-objects conda env)
cd EgoInfinity
export SAM3D_PYTHON=/path/to/sam3d-objects/env/bin/python
export SAM3D_REPO=/path/to/sam-3d-objectsThe sam3d_worker.py is spawned automatically by the pipeline when the env
vars are set.
SAM3 object prompts come from the
objectslist inmanifest.json(one short noun phrase per target). See docs/MANIFEST.md.
To screen many videos for "static-camera + visible hands" candidates before running the pipeline, use the visual filter over an Action100M sqlite index:
python -m egoinfinity filter --from-db data/action100m_index.db \
--test_n_videos 500 --viz :8899It runs YOLO hand detection + static-background + optical-flow checks and
serves an optional review UI. (Filtering an arbitrary --input-list of
video URLs is planned but not yet wired.) See action100m_filter/ and
docs/THIRD_PARTY.md for the dataset acquisition flow.
After tracking, retarget the recovered hand motion to a bilateral robot:
python -m egoinfinity process /path/to/clip_dir/ --robot g1
# robots: g1 | franka | robonaut2 | xlerobot
# output: <clip>/retarget/<robot>/{trajectory.npz, robot_sim.mp4,
# input_viz.mp4, metrics.npz}This runs in the same egoinfinity env (needs the step-4b retarget
deps: mujoco 3.6 + pytorch-kinematics + roma + PyOpenGL≥3.1.10). Headless
rendering uses EGL automatically (MUJOCO_GL=egl). Pretrained policies
ship in retarget/ckpts/<robot>.pt. The retarget/ subtree is maintained
independently; training (not needed for inference) additionally requires
jax[cuda12] + mujoco-mjx (see retarget/README.md).
EgoInfinity/
├── egoinfinity/ # ★ Orchestration layer (python -m egoinfinity) ★
│ ├── cli/ # process / run / status / filter / import
│ ├── core/ # config, artifacts, state, stage, runner, registry
│ ├── stages/ # Stage wrappers (phase1, post_track.*, retarget, …)
│ └── viz/ # filter viz server
│
├── egoinfinity/pipeline/ # Algorithm layer (importable; called by stages/scripts)
│ ├── config.py # Paths + hyperparameters (env-var driven)
│ ├── moge2_estimator.py # Phase A gravity_estimator.py # Phase A-grav
│ ├── hand_detector.py # Phase B-1 hand_reconstructor.py # Phase B-2
│ ├── depth_align.py / depth_stabilize.py # Phase C
│ ├── motion_infiller.py # Phase C+ mano_smoothing.py / biomech_constraints.py
│ ├── object_tracker.py # Phase D / D-sam3
│ ├── sam3_client.py / sam3d_client.py # Unix-socket → workers
│ ├── flow3r_depth.py # opt-in Flow3R depth refinement
│ ├── export_retarget_samples.py # pkl → retarget SamplesSequence
│ ├── pose_tracker/ # ★ Phase D-track 6DoF mesh tracking ★
│ ├── post_tracking/ # canonical 8-stage post-tracking algorithms
│ ├── interfaces/ # backend Protocol contracts
│ └── backends/ # swappable-backend registry
│
├── scripts/
│ ├── exo_pipeline.py # Phase-1 subprocess entry (A→E)
│ ├── pipeline_utils.py # free_gpu / image / pkl helpers
│ ├── sam3_worker.py / sam3d_worker.py # SAM3 / SAM 3D Objects workers
│ ├── sam3_detect_cli.py # SAM3 single-image fallback (slow path)
│ └── setup_weights.sh # Download pretrained checkpoints
│
├── tools/ # batch_pipeline + standalone re-run helpers
├── action100m_filter/ # Generic static-cam + hand video filter
├── retarget/ # MuJoCo / IK retargeting to robot arms (independent)
├── third_party/ # sam2 (vendored); wilor (fetched at install, see setup_wilor.sh)
├── configs/ # defaults.yaml + wilor_model_config.yaml
├── docs/ # PIPELINE.md, ARCHITECTURE.md, MANIFEST.md, … (see "Documentation")
├── pretrained_models/ # Gitignored; populated by setup_weights.sh
├── activate.sh.example # Env template — copy + edit, then `source`
├── pyproject.toml / requirements.txt / license.txt
└── README.md
All paths are env-driven so the repo is portable. See
activate.sh.example for the template.
| Variable | Default | Purpose |
|---|---|---|
EGOINFINITY_CKPT_DIR |
<repo>/pretrained_models |
Where WiLoR / SAM2 / infiller .pt files live |
HF_HOME |
~/.cache/huggingface |
HuggingFace cache (MoGe-2, DINOv2, SAM3.1, MEMFOF) |
SAM3_PYTHON, SAM3_REPO |
~/miniconda3/envs/sam3/bin/python, <repo>/../sam3 |
SAM3 worker |
SAM3D_PYTHON, SAM3D_REPO |
<repo>/../sam-3d-objects |
SAM 3D Objects worker |
YTDLP_COOKIES_FILE |
(empty) | Path to YouTube cookies (Netscape format) for Action100M downloader |
EGOINFINITY_RUN_SAM3D |
1 |
Set 0 on 4070 Ti — skip Phase D-sam3d (see docs/PIPELINE.md §4.2 multi-host note) |
EGOINFINITY_RUN_TRACK |
1 |
Set 0 to skip Phase D-track |
EGOINFINITY_LOW_VRAM |
0 |
Set 1 on small GPUs — lazy-load SAM3 / SAM3D workers to cap residency |
EGOINFINITY_HAND_TTA |
0 |
Set 1 for multi-scale (640/960/1280) YOLO TTA on hand detection — costs ~3× detector latency, boosts recall on fast-motion / oblique-angle clips |
SAM3D_SPAWN_TIMEOUT |
240 |
Bump to 600 on slow shared-FS clusters (Delta Lustre etc.) |
ACTION100M_CACHE |
<repo>/cache |
Override the favorites cache root (used by batch_pipeline + curation tools) |
ACTION100M_DB |
<repo>/data/action100m_index.db |
Override the Action100M SQLite index path (used by action100m_filter) |
Designed for a single A100 (40 GB) or H100 (80 GB). When Phase D is enabled, two persistent GPU workers add ~14 GB residency:
| Worker | Process | Resident VRAM |
|---|---|---|
| SAM3.1 | sam3_worker.py (sam3 env subprocess) |
~4 GB |
| SAM 3D Objects | sam3d_worker.py (sam3d-objects env subprocess) |
~10 GB |
| Pipeline (active phase) | exo_pipeline.py (sequential model load) |
3-5 GB peak |
Smaller GPUs are not officially supported in this revision; SAM3 / SAM3D
would need lazy load + idle-unload (planned — see EGOINFINITY_LOW_VRAM).
This repo bundles or links code under several licenses:
| Component | License |
|---|---|
| Project source (Python, etc.) | MIT — see license.txt |
| WiLoR (Phase B; fetched at install, not redistributed) | CC-BY-NC-ND — research-only, no derivatives |
SAM2 (third_party/sam2/) |
Apache 2.0 |
Ultralytics YOLO (ultralytics dep) |
AGPL-3.0 (network copyleft) |
| MANO model (manual download) | Non-commercial research only |
Robot assets (retarget/robots/) |
Derived from MuJoCo Menagerie / ManiSkill / NASA — see retarget/robots/README.md |
| MoGe-2 / GeoCalib / SAM3.1 / SAM 3D Objects / MEMFOF | each upstream's license |
The project's own code is MIT, but commercial use of this repo as a whole is restricted by the WiLoR (CC-BY-NC-ND) and MANO (non-commercial) terms, and the bundled Ultralytics YOLO is AGPL-3.0 (its network-copyleft applies if you serve the pipeline over a network). See docs/THIRD_PARTY.md.
@misc{egoinfinity_2026,
title = {EgoInfinity: A Web-Scale 4D Hand-Object Interaction Data Engine for Any-View Robot Retargeting and Video-to-Action Robot Learning},
author = {Wang, Gaotian and Ren, Kejia and Morgan, Andrew and Chen, Yiting and Qian, Howard H. and Chanrungmaneekul, Podshara and Hang, Kaiyu},
year = {2026},
eprint = {2606.17385},
archivePrefix = {arXiv},
primaryClass = {cs.RO},
url = {https://arxiv.org/abs/2606.17385}
}Plus the upstream backbones — please cite WiLoR / MoGe-2 / SAM2 / SAM3 / HaWoR / MANO as appropriate to the components you use.
Built on:
- WiLoR — Hand localization & MANO recon (Potamias et al., CVPR 2025)
- HaWoR — World-space hand motion + infiller (CVPR 2025)
- MoGe-2 — Monocular metric depth (Microsoft)
- GeoCalib — Gravity / camera calibration (Phase A-grav)
- SAM2 / SAM3 / SAM 3D Objects (Meta)
- MEMFOF — Optical flow (Phase C / D-track)
- Open3D — 6DoF pose tracking geometry (Phase D-track)
- HaMeR — Hand reconstruction backbone
- Ultralytics — YOLO hand detection