Provably Efficient Reinforcement Learning with Linear Function Approximation

Open AccessPosted Content

Provably Efficient Reinforcement Learning with Linear Function Approximation

- 11 Jul 2019

360

TL;DR: This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.

Abstract: Modern Reinforcement Learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation tradeoff. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate function approximation? This question persists even in a basic setting with linear dynamics and linear rewards, for which only linear function approximation is needed. This paper presents the first provable RL algorithm with both polynomial runtime and polynomial sample complexity in this linear setting, without requiring a "simulator" or additional assumptions. Concretely, we prove that an optimistic modification of Least-Squares Value Iteration (LSVI)---a classical algorithm frequently studied in the linear setting---achieves $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret, where $d$ is the ambient dimension of feature space, $H$ is the length of each episode, and $T$ is the total number of steps. Importantly, such regret is independent of the number of states and actions.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Posted Content

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

Alekh Agarwal, +3 more

- 01 Aug 2019

- arXiv: Learning

TL;DR: This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs), and shows an important interplay between estimation error, approximation error, and exploration.

...read moreread less

367

•Posted Content

Provably Efficient Exploration in Policy Optimization

Qi Cai, +3 more

- 12 Dec 2019

- arXiv: Learning

TL;DR: This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves regret.

...read moreread less

176

•Proceedings Article

Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning

Simon S. Du, +3 more

- 30 Apr 2020

TL;DR: For example, this article showed that even if the agent has a highly accurate linear representation, the agent still needs to sample an exponential number of trajectories in order to find a near-optimal policy.

...read moreread less

166

•Posted Content

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

Dongruo Zhou, +2 more

- 15 Dec 2020

- arXiv: Learning

TL;DR: A new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise and a new, computationally efficient algorithm with linear function approximation named UCRL-VTR for the aforementioned linear mixture MDPs in the episodic undiscounted setting are proposed.

...read moreread less

141

•Posted Content

Naive Exploration is Optimal for Online LQR

Max Simchowitz, +1 more

- 27 Jan 2020

- arXiv: Learning

TL;DR: New upper and lower bounds are proved demonstrating that the optimal regret scales as $\widetilde{\Theta}({\sqrt{d_{\mathbf{u}}^2 d_{\ mathbf{x}} T}})$, where $T$ is the number of time steps, $d_{ \mathbf {u}}$ isthe dimension of the input space, and $d_x$ isTheta, the dimensions of the system state.

...read moreread less

131

...

Expand

References

•Book

Reinforcement Learning: An Introduction

Richard S. Sutton, +1 more

- 01 Jan 1988

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.

...read moreread less

39.7K

Advances in Neural Information Processing Systems 28

Peter A. Flach, +1 more

- 12 Dec 2015

13.6K

•Book

Markov Decision Processes: Discrete Stochastic Dynamic Programming

Martin L. Puterman

- 15 Apr 1994

TL;DR: Puterman as discussed by the authors provides a uniquely up-to-date, unified, and rigorous treatment of the theoretical, computational, and applied research on Markov decision process models, focusing primarily on infinite horizon discrete time models and models with discrete time spaces while also examining models with arbitrary state spaces, finite horizon models, and continuous time discrete state models.

...read moreread less

12.3K

Advances in Neural Information Processing Systems 29

Onur Teymur, +2 more

- 01 Jan 2016

11K

...

Expand

Provably Efficient Reinforcement Learning with Linear Function Approximation

Chat with Paper

AI Agents for this Paper

Citations

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

Provably Efficient Exploration in Policy Optimization

Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

Naive Exploration is Optimal for Online LQR

References

Reinforcement Learning: An Introduction

Mastering the game of Go with deep neural networks and tree search

Advances in Neural Information Processing Systems 28

Markov Decision Processes: Discrete Stochastic Dynamic Programming

Advances in Neural Information Processing Systems 29

Related Papers (5)

Near-optimal Regret Bounds for Reinforcement Learning

Reinforcement Learning: An Introduction

Contextual decision processes with low Bellman rank are PAC-learnable

Is Q-learning Provably Efficient?

Minimax regret bounds for reinforcement learning