Continuous upper confidence trees
Adrien Couëtoux,Jean-Baptiste Hoock,Nataliya Sokolovska,Olivier Teytaud,Nicolas Bonnard +4 more
- 17 Jan 2011
- pp 433-445
TL;DR: It is guess that the double-progressive widening trick can be used for other algorithms as well, as a general tool for ensuring a good bias/variance compromise in search algorithms.
read more
Abstract: Upper Confidence Trees are a very efficient tool for solving Markov Decision Processes; originating in difficult games like the game of Go, it is in particular surprisingly efficient in high dimensional problems. It is known that it can be adapted to continuous domains in some cases (in particular continuous action spaces). We here present an extension of Upper Confidence Trees to continuous stochastic problems. We (i) show a deceptive problem on which the classical Upper Confidence Tree approach does not work, even with arbitrarily large computational power and with progressive widening (ii) propose an improvement, termed double-progressive widening, which takes care of the compromise between variance (we want infinitely many simulations for each action/state) and bias (we want sufficiently many nodes to avoid a bias by the first nodes) and which extends the classical progressive widening (iii) discuss its consistency and show experimentally that it performs well on the deceptive problem and on experimental benchmarks. We guess that the double-progressive widening trick can be used for other algorithms as well, as a general tool for ensuring a good bias/variance compromise in search algorithms.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

Fig. 1. Shape of the reward function: Trap problem. 
Fig. 3. Median of the reward, for the trap problem with a = 70, h = 100, l = 1, w = 0.7, R = 0.01 ![Fig. 2. Mean of the reward, for the trap problem with a = 70, h = 100, l = 1, w = 0.7, R = 0.01. The estimated standard deviations of the rewards are STDDPW = [13.06, 12.88, 12.88, 12.06, 14.70, 0, 0] for Double PW and STDSPW = [7.16, 7.16, 8.63, 9.05, 0, 0, 0] for Simple PW - the differences are clearly significant, where STD means standard deviation.](/figures/fig-2-mean-of-the-reward-for-the-trap-problem-with-a-70-h-2f1qmbu6.png)
Fig. 2. Mean of the reward, for the trap problem with a = 70, h = 100, l = 1, w = 0.7, R = 0.01. The estimated standard deviations of the rewards are STDDPW = [13.06, 12.88, 12.88, 12.06, 14.70, 0, 0] for Double PW and STDSPW = [7.16, 7.16, 8.63, 9.05, 0, 0, 0] for Simple PW - the differences are clearly significant, where STD means standard deviation. 
Fig. 5. The power management problem. Median values of cumulated reward. Experiments with 6 stocks and 21 time steps. 
Fig. 4. The power management problem. Median values of cumulated reward with 2 stocks and 5 time steps.
Citations
A Survey of Monte Carlo Tree Search Methods
Cameron Browne,Edward J. Powley,Daniel Whitehouse,Simon M. Lucas,Peter I. Cowling,Philipp Rohlfshagen,S. Tavener,Diego Perez,Spyridon Samothrakis,Simon Colton +9 more
TL;DR: A survey of the literature to date of Monte Carlo tree search, intended to provide a snapshot of the state of the art after the first five years of MCTS research, outlines the core algorithm's derivation, impart some structure on the many variations and enhancements that have been proposed, and summarizes the results from the key game and nongame domains.
Combining Planning and Deep Reinforcement Learning in Tactical Decision Making for Autonomous Driving
Carl-Johan Hoel,Katherine Driggs-Campbell,Krister Wolff,Leo Laine,Mykel J. Kochenderfer +4 more
- 01 Jun 2020
TL;DR: A general framework for tactical decision making is introduced, which combines the concepts of planning and learning, in the form of Monte Carlo tree search and deep reinforcement learning, based on the AlphaGo Zero algorithm, extended to a domain with a continuous state space where self-play cannot be used.
316
Adaptive Stress Testing for Autonomous Vehicles
Mark Koren,Saud Al-Saif,Ritchie Lee,Mykel J. Kochenderfer +3 more
- 26 Jun 2018
TL;DR: It is shown that DRL can find more likely failure scenarios than MCTS with fewer calls to the simulator and can be easily applied to other scenarios given the appropriate models of the vehicle and the environment.
227
•Posted Content
Online algorithms for POMDPs with continuous state, action, and observation spaces
TL;DR: In this paper, double progressive widening (DPW) is used to solve partially observable Markov decision processes with continuous state, action, and observation spaces, but this modification alone is not sufficient because the belief representations in the search tree collapse to a single particle causing the algorithm to converge to a policy that is suboptimal regardless of the computation time.
114
Cooperation-Aware Reinforcement Learning for Merging in Dense Traffic
Maxime Bouton,Alireza Nakhaei,Kikuo Fujimura,Mykel J. Kochenderfer +3 more
- 01 Oct 2019
TL;DR: In this article, a reinforcement learning approach is presented to learn how to interact with drivers with different cooperation levels in a dense merging scenario with less deadlocks than with online planning methods.
99
References
Bandit based monte-carlo planning
Levente Kocsis,Csaba Szepesvári +1 more
- 18 Sep 2006
TL;DR: In this article, a bandit-based Monte-Carlo planning algorithm is proposed for large state-space Markovian decision problems (MDPs), which is one of the few viable approaches to find near-optimal solutions.
•Journal Article
Using confidence bounds for exploitation-exploration trade-offs
TL;DR: It is shown how a standard tool from statistics, namely confidence bounds, can be used to elegantly deal with situations which exhibit an exploitation-exploration trade-off, and improves the regret from O(T3/4) to T1/2.
Efficient selectivity and backup operators in Monte-Carlo tree search
Rémi Coulom
- 29 May 2006
TL;DR: A new framework to combine tree search with Monte-Carlo evaluation, that does not separate between a min-max phase and a Monte- carlo phase is presented, that provides finegrained control of the tree growth, at the level of individual simulations, and allows efficient selectivity.
Combining online and offline knowledge in UCT
Sylvain Gelly,David Silver +1 more
- 20 Jun 2007
TL;DR: This work considers three approaches for combining offline and online value functions in the UCT algorithm, and combines these algorithms in MoGo, the world's strongest 9 x 9 Go program, where each technique significantly improves MoGo's playing strength.
Related Papers (5)
Levente Kocsis,Csaba Szepesvári +1 more
- 18 Sep 2006
Richard S. Sutton,Andrew G. Barto +1 more
- 01 Jan 1988