Journal Article10.48550/arxiv.2409.00588
Diffusion Policy Policy Optimization
Allen Z. Ren,Justin Lidard,Lars Lien Ankile,Anthony Simeonov,Pulkit Agrawal,Anirudha Majumdar,Benjamin Burchfiel,Hongkai Dai,Max Simchowitz +8 more
- 31 Aug 2024
TL;DR: DPPO, a novel algorithm, optimizes diffusion-based policies in continuous control and robot learning tasks using policy gradient methods, achieving stronger performance and efficiency than other RL methods, leveraging synergies between RL fine-tuning and diffusion parameterization.
read more
Abstract: We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG methods are ubiquitous in training RL policies with other policy parameterizations; nevertheless, they had been conjectured to be less efficient for diffusion-based policies. Surprisingly, we show that DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks compared to other RL methods for diffusion-based policies and also compared to PG fine-tuning of other policy parameterizations. Through experimental investigation, we find that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization, leading to structured and on-manifold exploration, stable training, and strong policy robustness. We further demonstrate the strengths of DPPO in a range of realistic settings, including simulated robotic tasks with pixel observations, and via zero-shot deployment of simulation-trained policies on robot hardware in a long-horizon, multi-stage manipulation task. Website with code: diffusion-ppo.github.io
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Finetuning Generative Trajectory Model with Reinforcement Learning from Human Feedback
Derun Li,Jianwei Ren,Yue Wang,Xin Wen,Pengxiang Li,Leimeng Xu,Kun Zhan,Zhongpu Xia,Peng Jia,Xianpeng Lang,Ningyi Xu,Hang Zhao +11 more
TL;DR: This study introduces TrajHF, a finetuning framework that aligns generative trajectory models with diverse driving styles using human feedback and reinforcement learning, achieving comparable performance to state-of-the-art on NavSim benchmark while ensuring safety and feasibility.
Task-agnostic Pre-training and Task-guided Fine-tuning for Versatile Diffusion Planner
Chenyou Fan,Chenjia Bai,Shan Zhao,Haoran He,Yang Zhang,Zhen Wang +5 more
- 30 Sep 2024
TL;DR: This paper proposes SODP, a two-stage framework that leverages sub-optimal data to learn a versatile diffusion planner, generalizable for various downstream tasks, via pre-training and task-guided fine-tuning with reinforcement learning.
DiWA: Diffusion Policy Adaptation with World Models
Akshay L Chandra,Iman Nematollahi,Chen Huang,Tim Welschehold,Wolfram Burgard,A. Valada +5 more
TL;DR: DiWA, a novel framework, leverages a world model to fine-tune diffusion-based robotic skills offline with reinforcement learning, achieving dramatically improved sample efficiency and requiring orders of magnitude fewer physical interactions than model-free baselines.
A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search
Arnav Kumar Jain,Vibhakar Mohta,Subin Kim,Atiksh Bhardwaj,Juntao Ren,Yunhai Feng,Sanjiban Choudhury,Gokul Swamy +7 more
TL;DR: This paper proposes SAILOR, a robust imitation learning approach that teaches agents to search and recover from mistakes, outperforming state-of-the-art diffusion policies on visual manipulation tasks, and demonstrating improved robustness and nuance in failure identification.
References
Pattern Recognition and Machine Learning
TL;DR: This book covers a broad range of topics for regular factorial designs and presents all of the material in very mathematical fashion and will surely become an invaluable resource for researchers and graduate students doing research in the design of factorial experiments.
30.8K
•Posted Content
Proximal Policy Optimization Algorithms
TL;DR: A new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent, are proposed.
18K
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning
TL;DR: This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units that are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reInforcement tasks, and they do this without explicitly computing gradient estimates.
Training language models to follow instructions with human feedback
Long Ouyang,Jeffrey Wu,Xu Jiang,Diogo Almeida,Carroll L. Wainwright,Pamela Mishkin,Chong Zhang,Sandhini Agarwal,Katarina Slama,Alex Ray,John Schulman,Jacob Hilton,Fraser Kelton,Luke E. Miller,Maddie Simens,Amanda Askell,Peter Welinder,Paul F. Christiano,Jan Leike,Ryan Lowe +19 more
- 04 Mar 2022
TL;DR: The results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent and showing improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets.
7.1K
•Posted Content
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
TL;DR: In this article, an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework is proposed, where the actor aims to maximize expected reward while also maximizing entropy.
6.7K