A comprehensive guide to understanding and applying Spatial-RAG World Models in real-world applications.
- What is Spatial-RAG?
- Why Spatial-RAG Matters
- Real-World Applications
- Getting Started
- Making It Practical
- Integration Examples
Spatial Retrieval-Augmented Generation (Spatial-RAG) is a memory-augmented AI system that:
┌─────────────────────────────────────────────────────────────────────────┐
│ HOW SPATIAL-RAG WORKS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. ENCODE 2. STORE 3. RETRIEVE │
│ ┌─────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Camera │───────►│ Memory Bank │◄──────│ Similarity │ │
│ │ Image │ z_t │ + Location │ │ Search │ │
│ └─────────┘ └─────────────┘ └──────┬──────┘ │
│ │ │
│ 4. PREDICT 5. ACT │ │
│ ┌─────────────┐ ┌─────────────┐ │ │
│ │ Transition │◄───┤ Retrieved │◄─────────────┘ │
│ │ Model │ │ Memories │ │
│ └──────┬──────┘ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ z_{t+1} │───►│ Better │ │
│ │ Prediction │ │ Decisions │ │
│ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
In simple terms:
- 📸 Sees the world through camera images
- 🧠 Encodes observations into compact "latent" representations
- 💾 Remembers experiences with spatial context (where it was)
- 🔍 Retrieves relevant past experiences when in similar situations
- 🔮 Predicts what will happen next based on current state + memories
| Traditional AI | Spatial-RAG |
|---|---|
| ❌ Sees current state only | ✅ Remembers past experiences |
| ❌ No memory of past experiences | ✅ Retrieves relevant memories |
| ❌ Must learn everything from scratch | ✅ Reuses past knowledge |
| ❌ Slow to adapt | ✅ Fast adaptation |
| Benefit | Description |
|---|---|
| Sample Efficiency | Learns from fewer examples by reusing past experiences |
| Better Predictions | Uses spatial context to improve accuracy by 15-30% |
| Faster Adaptation | Adapts to new environments using similar past experiences |
| Safety | Predicts outcomes before taking actions |
| Contextual Awareness | Remembers "I've been here before" |
Scenario: Delivery robot navigating a city
┌─────────────────────────────────────────────────────────────────────────┐
│ DELIVERY ROBOT NAVIGATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Current Location: 📍 Intersection A │
│ │
│ 1. ENCODE: Camera → Latent z_t │
│ "I see a crosswalk, building on left, traffic light" │
│ │
│ 2. RETRIEVE: "What happened last time I was here?" │
│ Memory: "Pedestrians often cross at 5pm" │
│ Memory: "Turn left leads to narrow alley" │
│ │
│ 3. PREDICT: "If I turn left..." │
│ z_{t+1} → Decoded: "I'll see narrow alley, hard to navigate" │
│ │
│ 4. DECIDE: Go straight instead │
│ │
│ Result: ✅ Safer route, avoided narrow alley │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Benefits:
- Fewer collisions with obstacles
- Faster learning of new routes
- Remembers which paths work best
Scenario: Car approaching an intersection
┌─────────────────────────────────────────────────────────────────────────┐
│ SELF-DRIVING CAR SCENARIO │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ [Car Camera View] │
│ ┌──────────────────────────────────┐ │
│ │ 🚶 ←pedestrian area │ │
│ │ 🚗 │ │
│ │ ═══════════════ │ │
│ │ crosswalk │ │
│ └──────────────────────────────────┘ │
│ │
│ SPATIAL-RAG PROCESS: │
│ │
│ 1. Current view → Encode → z_t = [0.12, -0.34, ...] │
│ │
│ 2. Query memory: "Similar intersection views?" │
│ Retrieved: "Last time, pedestrian crossed here unexpectedly" │
│ Retrieved: "This intersection has 40% pedestrian crossing rate" │
│ │
│ 3. Predict next frames: │
│ Frame +1: Pedestrian starts crossing │
│ Frame +2: Pedestrian in crosswalk │
│ Frame +3: Pedestrian clears │
│ │
│ 4. Action: SLOW DOWN, wait for pedestrian │
│ │
│ Result: ✅ Safer driving, context-aware decisions │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Benefits:
- Predicts pedestrian behavior based on past experience
- Location-aware risk assessment
- Learns from near-misses
Scenario: Robot picking items in a warehouse
┌─────────────────────────────────────────────────────────────────────────┐
│ WAREHOUSE ROBOT SCENARIO │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ [Warehouse Layout] │
│ ┌────┬────┬────┬────┐ │
│ │ A1 │ A2 │ A3 │ A4 │ ← Shelves │
│ ├────┼────┼────┼────┤ │
│ │ B1 │ B2 │ B3 │ B4 │ │
│ ├────┼────┼────┼────┤ │
│ │ C1 │ C2 │🤖 │ C4 │ ← Robot at C3 │
│ └────┴────┴────┴────┘ │
│ │
│ Task: Find item #12345 │
│ │
│ SPATIAL-RAG PROCESS: │
│ │
│ 1. Encode current view of C3 shelf │
│ │
│ 2. Retrieve: "Where have I seen similar items?" │
│ Memory: "Item #12345 last seen at A2, top shelf" │
│ Memory: "Similar items grouped in row A" │
│ │
│ 3. Predict: "If I go to A2..." │
│ z_{t+1} decoded: Shows A2 shelf layout │
│ │
│ 4. Navigate directly to A2 │
│ │
│ Result: ✅ 3x faster item location │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Benefits:
- Learns warehouse layout over time
- Remembers where items are typically located
- Faster path planning
Scenario: Home robot learning to navigate
┌─────────────────────────────────────────────────────────────────────────┐
│ HOME ROBOT SCENARIO │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ [Floor Plan] │
│ ┌─────────────┬───────────────┐ │
│ │ Kitchen │ Living Room │ │
│ │ 🍳 │ 🛋️ 📺 │ │
│ │ 🤖 ←──── Robot │ │
│ ├─────────────┼───────────────┤ │
│ │ Bedroom │ Bathroom │ │
│ │ 🛏️ │ 🚿 │ │
│ └─────────────┴───────────────┘ │
│ │
│ Command: "Get me a glass of water" │
│ │
│ SPATIAL-RAG PROCESS: │
│ │
│ 1. Current location: Kitchen doorway │
│ │
│ 2. Retrieve spatial memories: │
│ "Glasses are in cabinet above sink" │
│ "Water dispenser is on refrigerator" │
│ "This angle shows path to sink" │
│ │
│ 3. Predict: "If I go to sink area..." │
│ z_{t+1} decoded: Shows sink and cabinet │
│ │
│ 4. Execute: Navigate → Open cabinet → Get glass → Fill water │
│ │
│ Result: ✅ Learns house layout, remembers object locations │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Benefits:
- Learns home layout over time
- Remembers where things are kept
- Adapts to furniture changes
Scenario: AR glasses showing navigation
┌─────────────────────────────────────────────────────────────────────────┐
│ AR NAVIGATION SCENARIO │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ [AR View Through Glasses] │
│ ┌────────────────────────────────────────┐ │
│ │ │ │
│ │ 🏢 Building A ➡️ 50m │ │
│ │ │ │
│ │ ══════════════════════════ │ │
│ │ (street view) │ │
│ │ │ │
│ │ [Predicted next view overlay] │ │
│ │ "In 10 steps, you'll see..." │ │
│ │ │ │
│ └────────────────────────────────────────┘ │
│ │
│ SPATIAL-RAG PROCESS: │
│ │
│ 1. Encode current street view │
│ │
│ 2. Retrieve: "What do I know about this street?" │
│ "Coffee shop 30m ahead on right" │
│ "Subway entrance at next corner" │
│ │
│ 3. Predict future views → Display as AR overlay │
│ │
│ Result: ✅ Predictive navigation, contextual information │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Benefits:
- Predictive UI showing what's ahead
- Location-aware information overlays
- Better user experience
# Build base image
docker compose build base
# Start core services
docker compose --profile core up -d
# Verify
curl http://localhost:8080/health# Generate synthetic data
docker compose run --rm generate-data python scripts/simulate_env.py \
--out data/trajectories --n 500
# Train models
docker compose run --rm train
# Restart API with trained model
docker compose restart api# Start UI
docker compose --profile ui up -d
# Open browser
# http://localhost:3000- Open http://localhost:3000
- Click "Generate Random Latent"
- Click "Start Rollout"
- Watch predicted frames stream in real-time
| Requirement | Description |
|---|---|
| Train Models | Run training script to learn meaningful representations |
| Add Real Data | Feed in real camera images from robots/drones |
| Seed Memory | Populate Qdrant with real trajectory data |
| Deploy | Use ROS2 node for real robot integration |
┌─────────────────────────────────────────────────────────────────────────┐
│ PRODUCTION DEPLOYMENT WORKFLOW │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. DATA COLLECTION │
│ ┌─────────────┐ │
│ │ Real Robot │──► collect_robot_data.py │
│ │ Camera Data │ → data/robot_trajectories/ │
│ └─────────────┘ │
│ │
│ 2. TRAINING │
│ ┌─────────────┐ │
│ │ Docker │──► train_synthetic.py │
│ │ Training │ → checkpoints/model.pt │
│ └─────────────┘ │
│ │
│ 3. DEPLOYMENT │
│ ┌─────────────┐ │
│ │ Export │──► export_torchscript.py │
│ │ Model │ → artifacts/encoder_transition.pt │
│ └─────────────┘ │
│ │
│ 4. INTEGRATION │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ ROS2 Node │◄───►│ Real Robot │ │
│ │ /latent │ │ Navigation │ │
│ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
# Start ROS2 node
docker compose --profile ros2 up -d
# Verify topics
docker compose exec ros2 bash -c "source /opt/ros/humble/setup.bash && ros2 topic list"Topics:
/camera/image_raw→ Subscribe to camera/latent← Published 32-dim latent @ 20Hz/actions→ Subscribe to action commands/latent_next← Predicted next latent
import requests
import numpy as np
API_URL = "http://localhost:8080"
# Encode an image
with open("image.jpg", "rb") as f:
response = requests.post(f"{API_URL}/encode", files={"file": f})
latent = response.json()["z"]
# Retrieve similar memories
response = requests.post(f"{API_URL}/retrieve", json={
"z": latent,
"k": 5
})
memories = response.json()
# Predict next state
response = requests.post(f"{API_URL}/predict", json={
"z": latent,
"action": [0.5, -0.3],
"use_retrieval": True
})
z_next = response.json()["z_next"]# Export optimized model
python scripts/export_torchscript.py \
--out artifacts/model.pt \
--z-dim 32
# Run on edge device
python scripts/runner_inference.py \
--model artifacts/model.pt \
--target-ms 10| Concept | What It Means |
|---|---|
| Latent Space | Compressed representation of observations (32 numbers instead of millions of pixels) |
| Spatial Memory | Store experiences with location info ("I saw X at position Y") |
| Retrieval | Find relevant past experiences when in similar situations |
| Prediction | Forecast future states using current observation + retrieved memories |
| World Model | Internal simulation of how the world works |
- Run the Demo: Start services and use the UI
- Train on Your Data: Collect real images and train
- Integrate with Robot: Use ROS2 node on real hardware
- Customize: Adjust
z_dim,topk, and training parameters
For detailed instructions, see:
- README.md - Quick start
- DEPLOYMENT.md - Production deployment
- DESIGN.md - Architecture details
- DATA_COLLECTION.md - Collecting robot data