Skip to content

Latest commit

 

History

History
477 lines (393 loc) · 26.5 KB

File metadata and controls

477 lines (393 loc) · 26.5 KB

Spatial-RAG Practical Usage Guide

A comprehensive guide to understanding and applying Spatial-RAG World Models in real-world applications.

Table of Contents

  1. What is Spatial-RAG?
  2. Why Spatial-RAG Matters
  3. Real-World Applications
  4. Getting Started
  5. Making It Practical
  6. Integration Examples

What is Spatial-RAG?

Spatial Retrieval-Augmented Generation (Spatial-RAG) is a memory-augmented AI system that:

┌─────────────────────────────────────────────────────────────────────────┐
│                    HOW SPATIAL-RAG WORKS                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. ENCODE          2. STORE              3. RETRIEVE                   │
│  ┌─────────┐        ┌─────────────┐       ┌─────────────┐              │
│  │ Camera  │───────►│ Memory Bank │◄──────│ Similarity  │              │
│  │ Image   │  z_t   │ + Location  │       │ Search      │              │
│  └─────────┘        └─────────────┘       └──────┬──────┘              │
│                                                   │                      │
│  4. PREDICT         5. ACT                        │                      │
│  ┌─────────────┐    ┌─────────────┐              │                      │
│  │ Transition  │◄───┤  Retrieved  │◄─────────────┘                      │
│  │ Model       │    │  Memories   │                                     │
│  └──────┬──────┘    └─────────────┘                                     │
│         │                                                                │
│         ▼                                                                │
│  ┌─────────────┐    ┌─────────────┐                                     │
│  │ z_{t+1}     │───►│ Better      │                                     │
│  │ Prediction  │    │ Decisions   │                                     │
│  └─────────────┘    └─────────────┘                                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

In simple terms:

  • 📸 Sees the world through camera images
  • 🧠 Encodes observations into compact "latent" representations
  • 💾 Remembers experiences with spatial context (where it was)
  • 🔍 Retrieves relevant past experiences when in similar situations
  • 🔮 Predicts what will happen next based on current state + memories

Why Spatial-RAG Matters

Traditional AI vs Spatial-RAG

Traditional AI Spatial-RAG
❌ Sees current state only ✅ Remembers past experiences
❌ No memory of past experiences ✅ Retrieves relevant memories
❌ Must learn everything from scratch ✅ Reuses past knowledge
❌ Slow to adapt ✅ Fast adaptation

Real-World Benefits

Benefit Description
Sample Efficiency Learns from fewer examples by reusing past experiences
Better Predictions Uses spatial context to improve accuracy by 15-30%
Faster Adaptation Adapts to new environments using similar past experiences
Safety Predicts outcomes before taking actions
Contextual Awareness Remembers "I've been here before"

Real-World Applications

1. 🤖 Autonomous Robots and Drones

Scenario: Delivery robot navigating a city

┌─────────────────────────────────────────────────────────────────────────┐
│                    DELIVERY ROBOT NAVIGATION                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Current Location: 📍 Intersection A                                    │
│                                                                          │
│  1. ENCODE: Camera → Latent z_t                                         │
│     "I see a crosswalk, building on left, traffic light"               │
│                                                                          │
│  2. RETRIEVE: "What happened last time I was here?"                     │
│     Memory: "Pedestrians often cross at 5pm"                            │
│     Memory: "Turn left leads to narrow alley"                           │
│                                                                          │
│  3. PREDICT: "If I turn left..."                                        │
│     z_{t+1} → Decoded: "I'll see narrow alley, hard to navigate"       │
│                                                                          │
│  4. DECIDE: Go straight instead                                         │
│                                                                          │
│  Result: ✅ Safer route, avoided narrow alley                           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Benefits:

  • Fewer collisions with obstacles
  • Faster learning of new routes
  • Remembers which paths work best

2. 🚗 Self-Driving Cars

Scenario: Car approaching an intersection

┌─────────────────────────────────────────────────────────────────────────┐
│                    SELF-DRIVING CAR SCENARIO                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  [Car Camera View]                                                       │
│  ┌──────────────────────────────────┐                                   │
│  │    🚶 ←pedestrian area           │                                   │
│  │         🚗                        │                                   │
│  │    ═══════════════               │                                   │
│  │         crosswalk                │                                   │
│  └──────────────────────────────────┘                                   │
│                                                                          │
│  SPATIAL-RAG PROCESS:                                                    │
│                                                                          │
│  1. Current view → Encode → z_t = [0.12, -0.34, ...]                   │
│                                                                          │
│  2. Query memory: "Similar intersection views?"                         │
│     Retrieved: "Last time, pedestrian crossed here unexpectedly"       │
│     Retrieved: "This intersection has 40% pedestrian crossing rate"    │
│                                                                          │
│  3. Predict next frames:                                                │
│     Frame +1: Pedestrian starts crossing                                │
│     Frame +2: Pedestrian in crosswalk                                   │
│     Frame +3: Pedestrian clears                                         │
│                                                                          │
│  4. Action: SLOW DOWN, wait for pedestrian                              │
│                                                                          │
│  Result: ✅ Safer driving, context-aware decisions                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Benefits:

  • Predicts pedestrian behavior based on past experience
  • Location-aware risk assessment
  • Learns from near-misses

3. 📦 Warehouse Robots

Scenario: Robot picking items in a warehouse

┌─────────────────────────────────────────────────────────────────────────┐
│                    WAREHOUSE ROBOT SCENARIO                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  [Warehouse Layout]                                                      │
│  ┌────┬────┬────┬────┐                                                  │
│  │ A1 │ A2 │ A3 │ A4 │  ← Shelves                                       │
│  ├────┼────┼────┼────┤                                                  │
│  │ B1 │ B2 │ B3 │ B4 │                                                  │
│  ├────┼────┼────┼────┤                                                  │
│  │ C1 │ C2 │🤖 │ C4 │  ← Robot at C3                                   │
│  └────┴────┴────┴────┘                                                  │
│                                                                          │
│  Task: Find item #12345                                                 │
│                                                                          │
│  SPATIAL-RAG PROCESS:                                                    │
│                                                                          │
│  1. Encode current view of C3 shelf                                     │
│                                                                          │
│  2. Retrieve: "Where have I seen similar items?"                        │
│     Memory: "Item #12345 last seen at A2, top shelf"                   │
│     Memory: "Similar items grouped in row A"                            │
│                                                                          │
│  3. Predict: "If I go to A2..."                                         │
│     z_{t+1} decoded: Shows A2 shelf layout                              │
│                                                                          │
│  4. Navigate directly to A2                                             │
│                                                                          │
│  Result: ✅ 3x faster item location                                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Benefits:

  • Learns warehouse layout over time
  • Remembers where items are typically located
  • Faster path planning

4. 🏠 Embodied AI Assistants (Home Robots)

Scenario: Home robot learning to navigate

┌─────────────────────────────────────────────────────────────────────────┐
│                    HOME ROBOT SCENARIO                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  [Floor Plan]                                                           │
│  ┌─────────────┬───────────────┐                                       │
│  │   Kitchen   │   Living Room │                                       │
│  │   🍳        │     🛋️  📺    │                                       │
│  │         🤖 ←──── Robot      │                                       │
│  ├─────────────┼───────────────┤                                       │
│  │   Bedroom   │   Bathroom    │                                       │
│  │    🛏️       │     🚿        │                                       │
│  └─────────────┴───────────────┘                                       │
│                                                                          │
│  Command: "Get me a glass of water"                                     │
│                                                                          │
│  SPATIAL-RAG PROCESS:                                                    │
│                                                                          │
│  1. Current location: Kitchen doorway                                   │
│                                                                          │
│  2. Retrieve spatial memories:                                          │
│     "Glasses are in cabinet above sink"                                 │
│     "Water dispenser is on refrigerator"                                │
│     "This angle shows path to sink"                                     │
│                                                                          │
│  3. Predict: "If I go to sink area..."                                  │
│     z_{t+1} decoded: Shows sink and cabinet                             │
│                                                                          │
│  4. Execute: Navigate → Open cabinet → Get glass → Fill water           │
│                                                                          │
│  Result: ✅ Learns house layout, remembers object locations             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Benefits:

  • Learns home layout over time
  • Remembers where things are kept
  • Adapts to furniture changes

5. 👓 Augmented Reality (AR)

Scenario: AR glasses showing navigation

┌─────────────────────────────────────────────────────────────────────────┐
│                    AR NAVIGATION SCENARIO                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  [AR View Through Glasses]                                              │
│  ┌────────────────────────────────────────┐                            │
│  │                                        │                            │
│  │     🏢 Building A    ➡️ 50m            │                            │
│  │                                        │                            │
│  │  ══════════════════════════           │                            │
│  │        (street view)                  │                            │
│  │                                        │                            │
│  │  [Predicted next view overlay]        │                            │
│  │  "In 10 steps, you'll see..."        │                            │
│  │                                        │                            │
│  └────────────────────────────────────────┘                            │
│                                                                          │
│  SPATIAL-RAG PROCESS:                                                    │
│                                                                          │
│  1. Encode current street view                                          │
│                                                                          │
│  2. Retrieve: "What do I know about this street?"                       │
│     "Coffee shop 30m ahead on right"                                   │
│     "Subway entrance at next corner"                                   │
│                                                                          │
│  3. Predict future views → Display as AR overlay                        │
│                                                                          │
│  Result: ✅ Predictive navigation, contextual information               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Benefits:

  • Predictive UI showing what's ahead
  • Location-aware information overlays
  • Better user experience

Getting Started

Step 1: Set Up the System

# Build base image
docker compose build base

# Start core services
docker compose --profile core up -d

# Verify
curl http://localhost:8080/health

Step 2: Train with Synthetic Data

# Generate synthetic data
docker compose run --rm generate-data python scripts/simulate_env.py \
    --out data/trajectories --n 500

# Train models
docker compose run --rm train

# Restart API with trained model
docker compose restart api

Step 3: Test with the UI

# Start UI
docker compose --profile ui up -d

# Open browser
# http://localhost:3000

Step 4: Visualize Predictions

  1. Open http://localhost:3000
  2. Click "Generate Random Latent"
  3. Click "Start Rollout"
  4. Watch predicted frames stream in real-time

Making It Practical

What You Need

Requirement Description
Train Models Run training script to learn meaningful representations
Add Real Data Feed in real camera images from robots/drones
Seed Memory Populate Qdrant with real trajectory data
Deploy Use ROS2 node for real robot integration

Production Workflow

┌─────────────────────────────────────────────────────────────────────────┐
│                    PRODUCTION DEPLOYMENT WORKFLOW                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. DATA COLLECTION                                                      │
│     ┌─────────────┐                                                     │
│     │ Real Robot  │──► collect_robot_data.py                           │
│     │ Camera Data │    → data/robot_trajectories/                       │
│     └─────────────┘                                                     │
│                                                                          │
│  2. TRAINING                                                             │
│     ┌─────────────┐                                                     │
│     │ Docker      │──► train_synthetic.py                              │
│     │ Training    │    → checkpoints/model.pt                           │
│     └─────────────┘                                                     │
│                                                                          │
│  3. DEPLOYMENT                                                           │
│     ┌─────────────┐                                                     │
│     │ Export      │──► export_torchscript.py                           │
│     │ Model       │    → artifacts/encoder_transition.pt                │
│     └─────────────┘                                                     │
│                                                                          │
│  4. INTEGRATION                                                          │
│     ┌─────────────┐     ┌─────────────┐                                │
│     │ ROS2 Node   │◄───►│ Real Robot  │                                │
│     │ /latent     │     │ Navigation  │                                │
│     └─────────────┘     └─────────────┘                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Integration Examples

ROS2 Robot Integration

# Start ROS2 node
docker compose --profile ros2 up -d

# Verify topics
docker compose exec ros2 bash -c "source /opt/ros/humble/setup.bash && ros2 topic list"

Topics:

  • /camera/image_raw → Subscribe to camera
  • /latent ← Published 32-dim latent @ 20Hz
  • /actions → Subscribe to action commands
  • /latent_next ← Predicted next latent

Python API Integration

import requests
import numpy as np

API_URL = "http://localhost:8080"

# Encode an image
with open("image.jpg", "rb") as f:
    response = requests.post(f"{API_URL}/encode", files={"file": f})
    latent = response.json()["z"]

# Retrieve similar memories
response = requests.post(f"{API_URL}/retrieve", json={
    "z": latent,
    "k": 5
})
memories = response.json()

# Predict next state
response = requests.post(f"{API_URL}/predict", json={
    "z": latent,
    "action": [0.5, -0.3],
    "use_retrieval": True
})
z_next = response.json()["z_next"]

Edge Deployment (Jetson/Raspberry Pi)

# Export optimized model
python scripts/export_torchscript.py \
    --out artifacts/model.pt \
    --z-dim 32

# Run on edge device
python scripts/runner_inference.py \
    --model artifacts/model.pt \
    --target-ms 10

Key Takeaways

Concept What It Means
Latent Space Compressed representation of observations (32 numbers instead of millions of pixels)
Spatial Memory Store experiences with location info ("I saw X at position Y")
Retrieval Find relevant past experiences when in similar situations
Prediction Forecast future states using current observation + retrieved memories
World Model Internal simulation of how the world works

Next Steps

  1. Run the Demo: Start services and use the UI
  2. Train on Your Data: Collect real images and train
  3. Integrate with Robot: Use ROS2 node on real hardware
  4. Customize: Adjust z_dim, topk, and training parameters

For detailed instructions, see: