What is the mapping MF associated with any averaging method?

The mapping MF associated with any averaging method F is a nonexpansion in max norm, and is therefore compatible with the parallel value backup operator for any discounted MDP.

What is the power of the approximate value iteration method?

The power of the approximate value iteration method, on the other hand, comes from the fact that the authors can pay attention only to transitions from a certain small set of states.

How can the authors prove convergence for a wide class of approximate temporal di erence methods?

The authors have proved convergence for a wide class of approximate temporal di erence methods, and shown experimentally that these methods can solve Markov decision processes more e ciently than grids of comparable accuracy.

What is the way to generalize the value iteration backup operator?

The authors can generalize the above single-state version of the value iteration backup operator to allow parallel updating: instead of merely changing their estimate for one state at a time, the authors compute the new value for every state before altering any of the estimates.

What is the way to use the approximate Q-learning algorithm?

Q-learning is designed to work for online problems, where the authors don't know the cost or transition functions and can only sample transitions from their current state.

What are the real numbers in complete vector spaces?

Examples of complete vector spaces include the real numbers under absolute value and the n-vectors of real numbers under Manhattan (L1), Euclidean (L2), and max (L1) norms.

What is the definition of a real-valued function approximation scheme?

Definition: A real-valued function approximation scheme is an averager if every tted value is the weighted average of zero or more target values and possibly some predetermined constants.

Open AccessBook Chapter10.1016/B978-1-55860-377-6.50040-2

Stable function approximation in dynamic programming

Geoffrey J. Gordon

- 01 Jan 1995

- pp 261-268

615

TL;DR: A proof of convergence is provided for a wide class of temporal difference methods involving function approximators such as k-nearest-neighbor, and it is shown experimentally that these methods can be useful.

Abstract: The success of reinforcement learning in practical problems depends on the ability to combine function approximation with temporal difference methods such as value iteration. Experiments in this area have produced mixed results; there have been both notable successes and notable disappointments. Theory has been scarce, mostly due to the difficulty of reasoning about function approximators that generalize beyond the observed data. We provide a proof of convergence for a wide class of temporal difference methods involving function approximators such as k-nearest-neighbor, and show experimentally that these methods can be useful. The proof is based on a view of function approximators as expansion or contraction mappings. In addition, we present a novel view of approximate value iteration: an approximate algorithm for one environment turns out to be an exact algorithm for a different environment.

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Most frequently asked questions

1. What are the contributions mentioned in the paper "Stable function approximation in dynamic programming" ?

The authors provide a proof of convergence for a wide class of temporal di erence methods involving function approximators such as k-nearest-neighbor, and show experimentally that these methods can be useful.. In addition, the authors present a novel view of approximate value iteration: an approximate algorithm for one environment turns out to be an exact algorithm for a di erent environment.. The author 's current e-mail address is ggordon @ cs.. This material is based on work supported under a National Science Foundation Graduate Research Fellowship, and by NSF grant number BES-9402439.. Any opinions, ndings, conclusions, or recommendations expressed in this publication are those of the author and do not necessarily re ect the views of the National Science Foundation or the United States Government.

2. What is the parallel value iteration operator for a discounted Markov decision process?

The parallel value iteration operator for a discounted Markov decision process is a contraction in max norm, with contraction factor equal to the discount.

3. What is the main reason for divergence?

The chief reason for divergence is exaggeration: the more a method can exaggerate small changes in its target function, the more often it diverges under temporal di erencing.

4. What is the optimal value function for a nondiscounted Markov decision process?

If all policies in a nondiscounted Markov decision process are proper, then the parallel value iteration operator for that process is a contraction in some weighted max norm.

Figure 2: The mapping associated with linear regression when samples are taken at the points x = 0; 1;2. In (a) we see a target value function (solid line) and its corresponding tted value function (dotted line). In (b) we see another target function and another tted function. The rst target function has values y = 0;0;0 at the sample points; the second has values y = 0;1;1. Regression exaggerates the di erence between the two functions: the largest di erence between the two target functions at a sample point is 1 (at x = 1 and x = 2), but the largest di erence between the two tted functions at a sample point is 7 6 (at x = 2).

Figure 1: Function approximation methods as mappings. In (a) we see the value function for a simple random walk on the positive real line. (On each transition, the agent has an equal probability of moving left or right by one step. State 0 is absorbing; transitions from other states have cost 1.) Applying a function approximator (in this case, tting a spline with knots at the rst ve natural numbers) maps the value function in (a) to the value function in (b). Since the function approximator discards some information, its mapping can't be 1-to-1: in (c) we see a di erent value function which the approximator also maps to (b).

Figure 4: A nondiscounted deterministic Markov process, and an averager with which it is incompatible. The process is shown in (a); the goal is state 1, and all arc costs except at the goal are 1. In (b) we see an averager, represented as a Markov process: states 1 and 3 are unchanged, while V (2) is replaced by V (3). The derived Markov process is shown in (c); state 3 has been disconnected, so its value estimate will diverge.

Figure 5: An example of the construction of the derived Markov process. Part (a) shows a deterministic Markov process: its state space is the unit triangle, and on every step the agent moves a constant distance towards the origin. The value of each state is simply its distance from the origin, so the value function is nonlinear. For our function approximator, we will use linear interpolation on the three corners of the triangle. Part (b) shows a representative transition from the derived process: as before, the agent moves towards the goal, but then the averager moves the agent randomly to one of the three corners. On average, this scattering moves the agent back away from the goal, so steps in the derived process don't move the agent as far on average as they did in the original process. Part (c) shows the expected progress the agent makes on each step. The value function for the derived process is V (x; y) = x+ y.

Figure 6: The puddle world. From top left: the cost of moving up, the optimal value function as seen by a 100 100 grid, the optimal value function as seen by bilinear interpolation on the corners of a 7 7 grid, and the di erence between the two value functions.

Figure 3: A Markov process and a CMAC which are incompatible. Part (a) shows the process. Its goal is state 1. On each step, with probability 95%, the process follows a solid arrow, and with probability 5%, it follows a dashed arrow. All arc costs are zero. Part (b) shows the CMAC, which has 4 receptive elds each covering 3 nodes. If the CMAC starts out with all predictions equal to 1, approximate value iteration produces the series of target values 1; 77 60 ; (77 60 )2; . . . for state 2.

Citations

•Book

Reinforcement Learning: An Introduction

Richard S. Sutton, +1 more

- 01 Jan 1988

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.

...read moreread less

39.7K

Journal Article•10.1145/242224.242229

Machine learning

Thomas G. Dietterich

- 01 Dec 1996

- ACM Computing Surveys

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.

...read moreread less

14K

•Book

Dynamic Programming and Optimal Control

Dimitri P. Bertsekas

- 01 May 1995

TL;DR: The leading and most up-to-date textbook on the far-ranging algorithmic methododogy of Dynamic Programming, which can be used for optimal control, Markovian decision problems, planning and sequential decision making under uncertainty, and discrete/combinatorial optimization.

...read moreread less

12.9K

•Journal Article•10.1613/JAIR.301

Reinforcement learning: a survey

Leslie Pack Kaelbling, +2 more

- 01 Jan 1996

- Journal of Artificial Intelligence Resea...

TL;DR: Central issues of reinforcement learning are discussed, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state.

...read moreread less

•Proceedings Article

Policy Gradient Methods for Reinforcement Learning with Function Approximation

Richard S. Sutton, +3 more

- 29 Nov 1999

TL;DR: This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

...read moreread less

7.1K

...

Expand

References

Journal Article•10.1145/242224.242229

Machine learning

Thomas G. Dietterich

- 01 Dec 1996

- ACM Computing Surveys

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.

...read moreread less

14K

•Journal Article•10.1007/BF00992698

Technical Note : \cal Q -Learning

Chris Watkins, +1 more

- 01 May 1992

- Machine Learning

TL;DR: This paper presents and proves in detail a convergence theorem forQ-learning based on that outlined in Watkins (1989), showing that Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action- values are represented discretely.

...read moreread less

12K

•Book

Parallel and Distributed Computation: Numerical Methods

Dimitri P. Bertsekas, +1 more

- 01 Jan 1989

TL;DR: This work discusses parallel and distributed architectures, complexity measures, and communication and synchronization issues, and it presents both Jacobi and Gauss-Seidel iterations, which serve as algorithms of reference for many of the computational approaches addressed later.

...read moreread less

Learning from delayed rewards

Chris Watkins

- 01 Jan 1989

5.9K

•Journal Article•10.1023/A:1022633531479

Learning to Predict by the Methods of Temporal Differences

Richard S. Sutton

- 01 Aug 1988

- Machine Learning

TL;DR: This article introduces a class of incremental learning procedures specialized for prediction – that is, for using past experience with an incompletely known system to predict its future behavior – and proves their convergence and optimality for special cases and relation to supervised-learning methods.

...read moreread less

5.2K

...

Expand

Stable function approximation in dynamic programming

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What are the contributions mentioned in the paper "Stable function approximation in dynamic programming" ?

2. What is the parallel value iteration operator for a discounted Markov decision process?

3. What is the main reason for divergence?

4. What is the optimal value function for a nondiscounted Markov decision process?

5. What is the mapping MF associated with any averaging method?

6. What is the power of the approximate value iteration method?

7. How can the authors prove convergence for a wide class of approximate temporal di erence methods?

8. What is the way to generalize the value iteration backup operator?

9. What is the way to use the approximate Q-learning algorithm?

10. What are the real numbers in complete vector spaces?

11. What is the definition of a real-valued function approximation scheme?

Figures

Citations

Reinforcement Learning: An Introduction

Machine learning

Dynamic Programming and Optimal Control

Reinforcement learning: a survey

Policy Gradient Methods for Reinforcement Learning with Function Approximation

References

Machine learning

Technical Note : \cal Q -Learning

Parallel and Distributed Computation: Numerical Methods

Learning from delayed rewards

Learning to Predict by the Methods of Temporal Differences

Related Papers (5)

Reinforcement Learning: An Introduction

Least-squares policy iteration

Learning to Predict by the Methods of Temporal Differences

Learning from delayed rewards

Markov Decision Processes: Discrete Stochastic Dynamic Programming