TL;DR: This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
Abstract: Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, independent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
TL;DR: In this paper, the Structure Equations of Rn and Surfaces in Rn have been studied and the Theorem of Gauss-Bonnet has been proved for the integration of differential forms.
Abstract: 1. Differential Forms in Rn.- 2. Line Integrals.- 3. Differentiable Manifolds.- 4. Integration on Manifolds Stokes Theorem and Poincare's Lemma.- 1. Integration of Differential Forms.- 2. Stokes Theorem.- 3. Poincare's Lemma.- 5. Differential Geometry of Surfaces.- 1. The Structure Equations of Rn.- 2. Surfaces in R3.- 3. Intrinsic Geometry of Surfaces.- 6. The Theorem of Gauss-Bonnet and the Theorem of Morse.- 1. The Theorem of Gauss-Bonnet.- 2. The Theorem of Morse.- References.
TL;DR: It is shown how an action-dependent baseline can be used by the policy gradient theorem using function approximation, originally presented with action-independent baselines by Sutton et al. 2000.
Abstract: We show how an action-dependent baseline can be used by the policy gradient theorem using function approximation, originally presented with action-independent baselines by (Sutton et al. 2000).
TL;DR: Differential forms The Algrebra of Differential Forms Exterior Differentiation The Fundamental Correspondence Oriented Manifolds The Notion of a Manifold (With Boundary) Orientation Differential forms Revisited l-forms K-Forms Push-Forwards And Pull-Backs Integration Of Differential Form over OrientedManifolds Integration Via Pull-Back Support Integration Via pull-back Support Integration as discussed by the authors The Generalized Stokes' Theorem Statement Of The Theorem The Fundamental Theorem of Calculus And its Analog For Line Integrals Green
Abstract: Differential Forms The Algrebra of Differential Forms Exterior Differentiation The Fundamental Correspondence Oriented Manifolds The Notion Of A Manifold (With Boundary) Orientation Differential Forms Revisited l-Forms K-Forms Push-Forwards And Pull-Backs Integration Of Differential Forms Over Oriented Manifolds The Integral Of A 0-Form Over A Point (Evaluation) The Integral Of A 1-Form Over A Curve (Line Integrals) The Integral Of A2-Form Over A Surface (Flux Integrals) The Integral Of A 3-Form Over A Solid Body (Volume Integrals) Integration Via Pull-Backs The Generalized Stokes' Theorem Statement Of The Theorem The Fundamental Theorem Of Calculus And Its Analog For Line Integrals Green's And Stokes' Theorems Gauss's Theorem Proof of the GST For The Advanced Reader Differential Forms In IRN And Poincare's Lemma Manifolds, Tangent Vectors, And Orientations The Basics of De Rham Cohomology Appendix Answers To Exercises Subject Index
TL;DR: In this paper, the actor critic with emphatic weighting (ACE) algorithm is proposed to approximate the simplified gradient provided by the policy gradient theorem for off-policy reinforcement learning, where the behaviour policy is not necessarily attempting to learn and follow the optimal policy for the given task.
Abstract: Policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy setting, due to the existence of the policy gradient theorem which provides a simplified form for the gradient. In off-policy learning, however, where the behaviour policy is not necessarily attempting to learn and follow the optimal policy for the given task, the existence of such a theorem has been elusive. In this work, we solve this open problem by providing the first off-policy policy gradient theorem. The key to the derivation is the use of emphatic weightings. We develop a new actor-critic algorithm—called Actor Critic with Emphatic weightings (ACE)—that approximates the simplified gradients provided by the theorem. We demonstrate in a simple counterexample that previous off-policy policy gradient methods—particularly OffPAC and DPG—converge to the wrong solution whereas ACE finds the optimal solution.