Diffusion Policy Policy Optimization

doi:10.48550/arxiv.2409.00588

Journal Article10.48550/arxiv.2409.00588

Diffusion Policy Policy Optimization

Allen Z. Ren, +8 more

- 31 Aug 2024

6

TL;DR: DPPO, a novel algorithm, optimizes diffusion-based policies in continuous control and robot learning tasks using policy gradient methods, achieving stronger performance and efficiency than other RL methods, leveraging synergies between RL fine-tuning and diffusion parameterization.

Abstract: We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG methods are ubiquitous in training RL policies with other policy parameterizations; nevertheless, they had been conjectured to be less efficient for diffusion-based policies. Surprisingly, we show that DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks compared to other RL methods for diffusion-based policies and also compared to PG fine-tuning of other policy parameterizations. Through experimental investigation, we find that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization, leading to structured and on-manifold exploration, stable training, and strong policy robustness. We further demonstrate the strengths of DPPO in a range of realistic settings, including simulated robotic tasks with pixel observations, and via zero-shot deployment of simulation-trained policies on robot hardware in a long-horizon, multi-stage manipulation task. Website with code: diffusion-ppo.github.io

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.48550/arxiv.2503.10434

Finetuning Generative Trajectory Model with Reinforcement Learning from Human Feedback

Derun Li, +11 more

- 13 Mar 2025

- arXiv.org

TL;DR: This study introduces TrajHF, a finetuning framework that aligns generative trajectory models with diverse driving styles using human feedback and reinforcement learning, achieving comparable performance to state-of-the-art on NavSim benchmark while ensuring safety and feasibility.

...read moreread less

4

Journal Article•10.48550/arxiv.2409.19949

Task-agnostic Pre-training and Task-guided Fine-tuning for Versatile Diffusion Planner

Chenyou Fan, +5 more

- 30 Sep 2024

TL;DR: This paper proposes SODP, a two-stage framework that leverages sub-optimal data to learn a versatile diffusion planner, generalizable for various downstream tasks, via pre-training and task-guided fine-tuning with reinforcement learning.

...read moreread less

1

Journal Article•10.48550/arxiv.2508.03645

DiWA: Diffusion Policy Adaptation with World Models

Akshay L Chandra, +5 more

- 05 Aug 2025

- arXiv.org

TL;DR: DiWA, a novel framework, leverages a world model to fine-tune diffusion-based robotic skills offline with reinforcement learning, achieving dramatically improved sample efficiency and requiring orders of magnitude fewer physical interactions than model-free baselines.

...read moreread less

Journal Article•10.48550/arxiv.2506.05294

A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search

Arnav Kumar Jain, +7 more

- 05 Jun 2025

- arXiv.org

TL;DR: This paper proposes SAILOR, a robust imitation learning approach that teaches agents to search and recover from mistakes, outperforming state-of-the-art diffusion policies on visual manipulation tasks, and demonstrating improved robustness and nuance in failure identification.

...read moreread less

Journal Article•10.48550/arxiv.2505.07261

CHD: Coupled Hierarchical Diffusion for Long-Horizon Tasks

Ce Hao, +3 more

- 12 May 2025

- arXiv.org

References

Journal Article•10.1198/TECH.2007.S518

Pattern Recognition and Machine Learning

Radford M. Neal

- 01 Aug 2007

- Technometrics

TL;DR: This book covers a broad range of topics for regular factorial designs and presents all of the material in very mathematical fashion and will surely become an invaluable resource for researchers and graduate students doing research in the design of factorial experiments.

...read moreread less

30.8K

•Posted Content

Proximal Policy Optimization Algorithms

John Schulman, +4 more

- 20 Jul 2017

- arXiv: Learning

TL;DR: A new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent, are proposed.

...read moreread less

18K

•Journal Article•10.1007/BF00992696

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Ronald J. Williams

- 01 May 1992

- Machine Learning

TL;DR: This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units that are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reInforcement tasks, and they do this without explicitly computing gradient estimates.

...read moreread less

10.1K

Proceedings Article•10.48550/arXiv.2203.02155

Training language models to follow instructions with human feedback

Long Ouyang, +19 more

- 04 Mar 2022

TL;DR: The results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent and showing improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets.

...read moreread less

7.1K

•Posted Content

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Tuomas Haarnoja, +3 more

- 04 Jan 2018

- arXiv: Learning

TL;DR: In this article, an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework is proposed, where the actor aims to maximize expected reward while also maximizing entropy.

...read moreread less

6.7K

...

Expand