Open AccessPosted Content
Provably Efficient Reinforcement Learning with Linear Function Approximation
TL;DR: This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.
read more
Abstract: Modern Reinforcement Learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation tradeoff. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate function approximation? This question persists even in a basic setting with linear dynamics and linear rewards, for which only linear function approximation is needed.
This paper presents the first provable RL algorithm with both polynomial runtime and polynomial sample complexity in this linear setting, without requiring a "simulator" or additional assumptions. Concretely, we prove that an optimistic modification of Least-Squares Value Iteration (LSVI)---a classical algorithm frequently studied in the linear setting---achieves $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret, where $d$ is the ambient dimension of feature space, $H$ is the length of each episode, and $T$ is the total number of steps. Importantly, such regret is independent of the number of states and actions.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Posted Content
On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift
TL;DR: This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs), and shows an important interplay between estimation error, approximation error, and exploration.
367
•Posted Content
Provably Efficient Exploration in Policy Optimization
TL;DR: This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves regret.
•Proceedings Article
Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning
Simon S. Du,Sham M. Kakade,Ruosong Wang,Lin F. Yang +3 more
- 30 Apr 2020
TL;DR: For example, this article showed that even if the agent has a highly accurate linear representation, the agent still needs to sample an exponential number of trajectories in order to find a near-optimal policy.
•Posted Content
Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes
TL;DR: A new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise and a new, computationally efficient algorithm with linear function approximation named UCRL-VTR for the aforementioned linear mixture MDPs in the episodic undiscounted setting are proposed.
•Posted Content
Naive Exploration is Optimal for Online LQR
Max Simchowitz,Dylan J. Foster +1 more
TL;DR: New upper and lower bounds are proved demonstrating that the optimal regret scales as $\widetilde{\Theta}({\sqrt{d_{\mathbf{u}}^2 d_{\ mathbf{x}} T}})$, where $T$ is the number of time steps, $d_{ \mathbf {u}}$ isthe dimension of the input space, and $d_x$ isTheta, the dimensions of the system state.
131
References
•Book
Reinforcement Learning: An Introduction
Richard S. Sutton,Andrew G. Barto +1 more
- 01 Jan 1988
TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Mastering the game of Go with deep neural networks and tree search
David Silver,Aja Huang,Chris J. Maddison,Arthur Guez,Laurent Sifre,George van den Driessche,Julian Schrittwieser,Ioannis Antonoglou,Veda Panneershelvam,Marc Lanctot,Sander Dieleman,Dominik Grewe,John Nham,Nal Kalchbrenner,Ilya Sutskever,Timothy P. Lillicrap,Madeleine Leach,Koray Kavukcuoglu,Thore Graepel,Demis Hassabis +19 more
TL;DR: Using this search algorithm, the program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0.5, the first time that a computer program has defeated a human professional player in the full-sized game of Go.
Advances in Neural Information Processing Systems 28
Peter A. Flach,Meelis Kull +1 more
- 12 Dec 2015
13.6K
•Book
Markov Decision Processes: Discrete Stochastic Dynamic Programming
Martin L. Puterman
- 15 Apr 1994
TL;DR: Puterman as discussed by the authors provides a uniquely up-to-date, unified, and rigorous treatment of the theoretical, computational, and applied research on Markov decision process models, focusing primarily on infinite horizon discrete time models and models with discrete time spaces while also examining models with arbitrary state spaces, finite horizon models, and continuous time discrete state models.
12.3K
Advances in Neural Information Processing Systems 29
Onur Teymur,Kostas Zygalakis,Ben Calderhead +2 more
- 01 Jan 2016
11K
Related Papers (5)
Richard S. Sutton,Andrew G. Barto +1 more
- 01 Jan 1988
Chi Jin,Zeyuan Allen-Zhu,Sébastien Bubeck,Michael I. Jordan +3 more
- 01 Jan 2018
Mohammad Gheshlaghi Azar,Ian Osband,Rémi Munos +2 more
- 06 Aug 2017