Continuous upper confidence trees

doi:10.1007/978-3-642-25566-3_32

Open AccessBook Chapter10.1007/978-3-642-25566-3_32

Continuous upper confidence trees

Adrien Couëtoux, +4 more

- 17 Jan 2011

- pp 433-445

222

TL;DR: It is guess that the double-progressive widening trick can be used for other algorithms as well, as a general tool for ensuring a good bias/variance compromise in search algorithms.

Abstract: Upper Confidence Trees are a very efficient tool for solving Markov Decision Processes; originating in difficult games like the game of Go, it is in particular surprisingly efficient in high dimensional problems. It is known that it can be adapted to continuous domains in some cases (in particular continuous action spaces). We here present an extension of Upper Confidence Trees to continuous stochastic problems. We (i) show a deceptive problem on which the classical Upper Confidence Tree approach does not work, even with arbitrarily large computational power and with progressive widening (ii) propose an improvement, termed double-progressive widening, which takes care of the compromise between variance (we want infinitely many simulations for each action/state) and bias (we want sufficiently many nodes to avoid a bias by the first nodes) and which extends the classical progressive widening (iii) discuss its consistency and show experimentally that it performs well on the deceptive problem and on experimental benchmarks. We guess that the double-progressive widening trick can be used for other algorithms as well, as a general tool for ensuring a good bias/variance compromise in search algorithms.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Figures

Fig. 1. Shape of the reward function: Trap problem.

Fig. 3. Median of the reward, for the trap problem with a = 70, h = 100, l = 1, w = 0.7, R = 0.01

Fig. 2. Mean of the reward, for the trap problem with a = 70, h = 100, l = 1, w = 0.7, R = 0.01. The estimated standard deviations of the rewards are STDDPW = [13.06, 12.88, 12.88, 12.06, 14.70, 0, 0] for Double PW and STDSPW = [7.16, 7.16, 8.63, 9.05, 0, 0, 0] for Simple PW - the differences are clearly significant, where STD means standard deviation.

Fig. 5. The power management problem. Median values of cumulated reward. Experiments with 6 stocks and 21 time steps.

Fig. 4. The power management problem. Median values of cumulated reward with 2 stocks and 5 time steps.

Citations

•Journal Article•10.1109/TCIAIG.2012.2186810

A Survey of Monte Carlo Tree Search Methods

Cameron Browne, +9 more

- 03 Feb 2012

- IEEE Transactions on Computational Intel...

TL;DR: A survey of the literature to date of Monte Carlo tree search, intended to provide a snapshot of the state of the art after the first five years of MCTS research, outlines the core algorithm's derivation, impart some structure on the many variations and enhancements that have been proposed, and summarizes the results from the key game and nongame domains.

...read moreread less

3.5K

•Journal Article•10.1109/TIV.2019.2955905

Combining Planning and Deep Reinforcement Learning in Tactical Decision Making for Autonomous Driving

Carl-Johan Hoel, +4 more

- 01 Jun 2020

TL;DR: A general framework for tactical decision making is introduced, which combines the concepts of planning and learning, in the form of Monte Carlo tree search and deep reinforcement learning, based on the AlphaGo Zero algorithm, extended to a domain with a continuous state space where self-play cannot be used.

...read moreread less

316

•Proceedings Article•10.1109/IVS.2018.8500400

Adaptive Stress Testing for Autonomous Vehicles

Mark Koren, +3 more

- 26 Jun 2018

TL;DR: It is shown that DRL can find more likely failure scenarios than MCTS with fewer calls to the simulator and can be easily applied to other scenarios given the appropriate models of the vehicle and the environment.

...read moreread less

227

•Posted Content

Online algorithms for POMDPs with continuous state, action, and observation spaces

Zachary N. Sunberg, +1 more

- 18 Sep 2017

- arXiv: Artificial Intelligence

TL;DR: In this paper, double progressive widening (DPW) is used to solve partially observable Markov decision processes with continuous state, action, and observation spaces, but this modification alone is not sufficient because the belief representations in the search tree collapse to a single particle causing the algorithm to converge to a policy that is suboptimal regardless of the computation time.

...read moreread less

114

•Proceedings Article•10.1109/ITSC.2019.8916924

Cooperation-Aware Reinforcement Learning for Merging in Dense Traffic

Maxime Bouton, +3 more

- 01 Oct 2019

TL;DR: In this article, a reinforcement learning approach is presented to learn how to interact with drivers with different cooperation levels in a dense merging scenario with less deadlocks than with online planning methods.

...read moreread less

99

...

Expand

References

•Book Chapter•10.1007/11871842_29

Bandit based monte-carlo planning

Levente Kocsis, +1 more

- 18 Sep 2006

TL;DR: In this article, a bandit-based Monte-Carlo planning algorithm is proposed for large state-space Markovian decision problems (MDPs), which is one of the few viable approaches to find near-optimal solutions.

...read moreread less

3.3K

•Journal Article•10.1016/0196-8858(85)90002-8

Asymptotically efficient adaptive allocation rules

Tze Leung Lai, +1 more

- 01 Mar 1985

- Advances in Applied Mathematics

2.8K

•Journal Article

Using confidence bounds for exploitation-exploration trade-offs

Peter Auer

- 01 Mar 2003

- Journal of Machine Learning Research

TL;DR: It is shown how a standard tool from statistics, namely confidence bounds, can be used to elegantly deal with situations which exhibit an exploitation-exploration trade-off, and improves the regret from O(T3/4) to T1/2.

...read moreread less

1.9K

•Book Chapter•10.1007/978-3-540-75538-8_7

Efficient selectivity and backup operators in Monte-Carlo tree search

Rémi Coulom

- 29 May 2006

TL;DR: A new framework to combine tree search with Monte-Carlo evaluation, that does not separate between a min-max phase and a Monte- carlo phase is presented, that provides finegrained control of the tree growth, at the level of individual simulations, and allows efficient selectivity.

...read moreread less

1.5K

•Proceedings Article•10.1145/1273496.1273531

Combining online and offline knowledge in UCT

Sylvain Gelly, +1 more

- 20 Jun 2007

TL;DR: This work considers three approaches for combining offline and online value functions in the UCT algorithm, and combines these algorithms in MoGo, the world's strongest 9 x 9 Go program, where each technique significantly improves MoGo's playing strength.

...read moreread less

584

Continuous upper confidence trees

Chat with Paper

AI Agents for this Paper

Figures

Citations

A Survey of Monte Carlo Tree Search Methods

Combining Planning and Deep Reinforcement Learning in Tactical Decision Making for Autonomous Driving

Adaptive Stress Testing for Autonomous Vehicles

Online algorithms for POMDPs with continuous state, action, and observation spaces

Cooperation-Aware Reinforcement Learning for Merging in Dense Traffic

References

Bandit based monte-carlo planning

Asymptotically efficient adaptive allocation rules

Using confidence bounds for exploitation-exploration trade-offs

Efficient selectivity and backup operators in Monte-Carlo tree search

Combining online and offline knowledge in UCT

Related Papers (5)

Bandit based monte-carlo planning

A Survey of Monte Carlo Tree Search Methods

Efficient selectivity and backup operators in Monte-Carlo tree search

Finite-time Analysis of the Multiarmed Bandit Problem

Reinforcement Learning: An Introduction