| theme | seriph |
|---|---|
| highlighter | shiki |
| class | text-center |
| title | Diffusion Model for Control and Planning Tutorial |
| background | figs/diffuse_teaser.gif |
| layout | cover |
- 🔄 Recap: What is a Diffusion Model?
- 🚀 Motivation: Why a Generative Model in Control and Planning?
- 🛠️ Practice: How to Use the Diffuser?
- 📚 Literatures: Recent Research Progress in Diffusion for RL/Control
- 📝 Summary & Challenges in Diffusion Models
- Keynote: Generative model for distribution matching.
- Applications: Image and text generation, creative tasks.

- Keynote: Generative model for distribution matching.
- Applications: Image and text generation, creative tasks.
- Core: Score function for sample generation and distribution description.
$$
\boldsymbol{x}_{i+1} \leftarrow \boldsymbol{x}_i+c \nabla \log p\left(\boldsymbol{x}_i\right)+\sqrt{2 c} \boldsymbol{\epsilon}, \quad i=0,1, \ldots, K
$$

-
Keynote: Generative model for distribution matching.
-
Applications: Image and text generation, creative tasks.
-
Core: Score function for sample generation and distribution description.
-
Advantages:
- 🌟 Multimodal: Effective with multimodal distributions.
- 📈 Scalable: Suits high-dimensional problems.
- 🔒 Stable: Grounded in solid mathematics and training.
- 🔄 Non-autoregressive: Predicts entire trajectories efficiently.
- Generative Models: application in imitation learning to match expert data.
- Examples: GANs, VAEs in imitation learning.

- Generative Models: application in imitation learning to match expert data.
- Examples: GANs, VAEs in imitation learning.
- GAN in GAIL: Discriminator learning and policy training.
- Idea: Train a discriminator to distinguish between expert and agent data.
- Limitation: Struggles with multimodal distributions, unstable training.
- Generative Models: Crucial in control and planning.
- Examples: GANs, VAEs in imitation learning.
- VAE in ACT (ALOHA): Latent space learning for planning.
- Idea: learn a latent space for planning and control. (generate action in chunks)
- Limitation: hard to train.
layout: iframe
Scenario: Imitation Learning
- Challenge: Match high-dimensional, multimodal trajectory distributions.
- Solution: Diffusion models for expressive distribution matching.
- Common Method: GAIL with adversarial training.
- Limitation: Struggles with multimodal distributions, unstable training.
Scenario: Offline Reinforcement Learning
- Challenge: Outperform demonstrations, ensure close action distribution.
- Solution: Diffusion models to match action distribution effectively.
- Common Method: CQL, penalizes out-of-distribution samples.
- Limitation: over-conservative.
Scenario: Model-based Reinforcement Learning
- Challenge: Match dynamic model and policy's action distribution.
- Solution: Diffusion models for non-autoregressive, multimodal matching.
- Common method: planning with learned dynamics.
- Limitation: compounding error in long-horizon planning.
Key: using a powerful model to matching a high-dimensional, multimodal distribution.
- Action/Value distribution matching: grounded in demostrations -> offline RL.
- Trajectory distribution matching: dynamic feasibility and optimal trajectory distribution -> model-based RL.
- Transition distribution matching: dynamics matching in a non-autoregressive manner -> model-based RL.
- Most common: diffuse trajectory (
diffuser). - Diffused variable
x: state, action sequence.$\tau = {s_0, a_0, s_1, a_1, \ldots, s_T, a_T}$ .
| Task | Thing's to Diffuse | How to Diffuse |
|---|---|---|
| Image Generation | ![]() |
![]() |
| Planning | ![]() |
![]() |
layout: iframe
-
Objective: make the trained model can generalize to new constraints and tasks.
-
Common case: goal-conditioned, safety, new task etc.
-
Possible Methods:
- Guidance function (d): shift distribution with extra gradient.
- Classifier-free method: learn a model can both represent conditional and unconditional distribution.
- Inpainting (a): fill in the missing part of the trajectory by fixing certain start and end state.
- Guidance function: shift distribution with extra gradient.
- Predefined the guidance function:
- Method: shift distribution with a manually defined function
- Limitation: Might lead to OOD samples, which break the learned diffusion process.
- Predefined the guidance function:
- Guidance function: shift distribution with extra gradient.
▶️ leads to OOD samples- Predefined the guidance function:
- Method: shift distribution with a manually defined function
- Limitation: Might lead to OOD samples, which break the learned diffusion process.
- Learned classifier:
- Method: learning a classifier to distinguish between different constraints. (similar to GAN)
- Limitation: Hard to tune parameters.
- Predefined the guidance function:
- Guidance function: shift distribution with extra gradient.
▶️ leads to OOD samples - Classifier-Free Method: learn a model can both represent conditional and unconditional distribution.
- Method: drop out the condition term to learn a model can represent both conditional and unconditional distribution.
- Guidance function: shift distribution with extra gradient.
▶️ leads to OOD samples - Classifier-Free Method: learn a model can both represent conditional and unconditional distribution.
| Guidance Function Method | Classifier-Free Method |
|---|---|
![]() |
![]() |
- Guidance function: shift distribution with extra gradient.
▶️ leads to OOD samples - Classifier-Free Method: learn a model can both represent conditional and unconditional distribution.
- Inpainting: fill in the missing part of the trajectory by fixing certain start and end state.
- Method: fix the start and end state, and fill in the missing part of the trajectory.
- Common thing to diffuse: trajectory.
- Common way to impose constraints/add objectives: guidance function, classifier-free method, inpainting.
A detailed summary of each method can be found here.
The key of diffusion: how to get the score function.
- How to get score function: data-driven v.s. analytical.
- What to diffuse: sequential v.s. non-sequential.
- How to impose constraints/objectives: hard v.s. soft.
- How to get score function: data-driven v.s. analytical.
- Data-driven: learn the score function from data.
- Hybrid: learning from optimization intermediate results.
- Analytical: use the analytical score function.
- How to get score function: data-driven v.s. analytical.
- What to diffuse: sequential v.s. non-sequential.
- Action/Value: learn a model to match action/value distribution, serve as regularizer and policy.
- Transition: learn a model to match transition distribution, serve as a world mode.
▶️ MPC - Trajectory: learn a model to match trajectory distribution, serve as a TO solver. (planning state v.s. state-action v.s. action)
- How to get score function: data-driven v.s. analytical.
- What to diffuse: sequential v.s. non-sequential.
- How to impose constraints/objectives: hard v.s. soft.
- Guidance function: Predefined or learned
- Classifier-free: Use the unconditional score and conditional score (most common)
- Inpainting: Fix the state and fill in the missing parts of the distribution (complimentary to the other two)
- How to get score function: data-driven v.s. analytical.
- What to diffuse: sequential v.s. non-sequential.
- How to impose constraints/objectives: hard v.s. soft.
- Diffusion in robotics: matches demostration distribution from data.
- Use cases: imitation learning, offline RL, model-based RL.
- Role: Learns policy, trajectory, or model as a regularizer/world model/planner.
- Diffusion in robotics: matches dataset distribution in control and planning.
- Use cases: imitation learning, offline RL, model-based RL.
- Role: Learns policy, planner, or model as a distribution matching problem.
- Advantages: high-dimensional matching, stability, scalability.

-
Diffusion in robotics: matches dataset distribution in control and planning.
-
Use cases: imitation learning, offline RL, model-based RL.
-
Role: Learns policy, planner, or model as a distribution matching problem.
-
Challenges:
- 🕒 Computational cost: longer training and inference time.
- 🔀 Shifting distribution: difficulties in adapting to dynamic datasets.
- 📊 High variance: inconsistent performance in precision tasks.
- ⛔ Constraint satisfaction: limited adaptability to new constraints.
















