Scispace (Formerly Typeset)
  1. Home
  2. Topics
  3. Reinforcement learning
  4. 2000
  1. Home
  2. Topics
  3. Reinforcement learning
  4. 2000
Showing papers on "Reinforcement learning published in 2000"
Journal Article•10.1016/S0925-2312(00)00324-6•
Reinforcement Learning: An Introduction: R.S. Sutton, A.G. Barto, MIT Press, Cambridge, MA 1998, 322 pp. ISBN 0-262-19398-1

[...]

Jeffrey D. Johnson1, Jinghong Li1, Zengshi Chen1•
University of Toledo1
01 Nov 2000-Neurocomputing

8,140 citations

Journal Article•10.1613/JAIR.639•
Hierarchical reinforcement learning with the MAXQ value function decomposition

[...]

Thomas G. Dietterich1•
Oregon State University1
01 Aug 2000-Journal of Artificial Intelligence Research
TL;DR: The paper presents an online model-free learning algorithm, MAXQ-Q, and proves that it converges with probability 1 to a kind of locally-optimal policy known as a recursively optimal policy, even in the presence of the five kinds of state abstraction.
Abstract: This paper presents a new approach to hierarchical reinforcement learning based on decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs and decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs. The decomposition, known as the MAXQ decomposition, has both a procedural semantics--as a subroutine hierarchy--and a declarative semantics--as a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. It is based on the assumption that the programmer can identify useful subgoals and define subtasks that achieve these subgoals. By defining such subgoals, the programmer constrains the set of policies that need to be considered during reinforcement learning. The MAXQ value function decomposition can represent the value function of any policy that is consistent with the given hierarchy. The decomposition also creates opportunities to exploit state abstractions, so that individual MDPs within the hierarchy can ignore large parts of the state space. This is important for the practical application of the method. This paper defines the MAXQ hierarchy, proves formal results on its representational power, and establishes five conditions for the safe use of state abstractions. The paper presents an online model-free learning algorithm, MAXQ-Q, and proves that it converges with probability 1 to a kind of locally-optimal policy known as a recursively optimal policy, even in the presence of the five kinds of state abstraction. The paper evaluates the MAXQ representation and MAXQ-Q through a series of experiments in three domains and shows experimentally that MAXQ-Q (with state abstractions) converges to a recursively optimal policy much faster than flat Q learning. The fact that MAXQ learns a representation of the value function has an important benefit: it makes it possible to compute and execute an improved, non-hierarchical policy via a procedure similar to the policy improvement step of policy iteration. The paper demonstrates the effectiveness of this nonhierarchical execution experimentally. Finally, the paper concludes with a comparison to related work and a discussion of the design tradeoffs in hierarchical reinforcement learning.

1,837 citations

Journal Article•10.1162/089976600300015961•
Reinforcement Learning in Continuous Time and Space

[...]

Kenji Doya
01 Jan 2000-Neural Computation
TL;DR: A reinforcement learning framework for continuous-time dynamical systems without a priori discretization of time, state, and action based on the Hamilton-Jacobi-Bellman (HJB) equation for infinite-horizon, discounted reward problems is presented and algorithms for estimating value functions and improving policies with the use of function approximators are derived.
Abstract: This article presents a reinforcement learning framework for continuous time dynamical systems without a priori discretization of time, state, and action. Based on the Hamilton-Jacobi-Bellman(HJB) equation for infinite-horizon, discounted reward problems, we derive algorithms for estimating value functions and improving policies with the use of function approximators. The processof value function estimation is formulated asthe minimization of a continuous-time form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived, and their correspondences with the conventional residual gradient, TD(0), and TD(lambda) algorithms are shown. For policy improvement, two methods—a continuous actor-critic method and a value-gradient-based greedy policy—are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived. The advantage updating, a model-free algorithm derived previously, is also formulated in the HJBbased framework.The performance of the proposed algorithms is first tested in a nonlinear control task of swinging a pendulum up with limited torque. It is shown in the simulations that (1) the task is accomplished by the continuous actor-critic method in a number of trials several times fewer than by the conventional discrete actor-critic method; (2) among the continuous policy update methods, the value-gradient-based policy with a known or learned dynamic model performs several times better than the actor-critic method; and (3) a value function update using exponential eligibility traces is more efficient and stable than that based on Euler approximation. The algorithms are then tested in a higher-dimensional task: cartpole swing-up. This task is accomplished in several hundred trials using the value-gradient-based policy with a learned dynamic model.

1,147 citations

Proceedings Article•
Eligibility Traces for Off-Policy Policy Evaluation

[...]

Doina Precup1, Richard S. Sutton2, Satinder Singh2•
University of Massachusetts Amherst1, AT&T2
29 Jun 2000
TL;DR: This paper considers the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method, and analyzes and compares this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling.
Abstract: Eligibility traces have been shown to speed reinforcement learning, to make it more robust to hidden states, and to provide a link between Monte Carlo and temporal-difference methods. Here we generalize eligibility traces to off-policy learning, in which one learns about a policy different from the policy that generates the data. Off-policy methods can greatly multiply learning, as many policies can be learned about from the same data stream, and have been identified as particularly useful for learning about subgoals and temporally extended macro-actions. In this paper we consider the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method. We analyze and compare this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling. Our main results are 1) to establish the consistency and bias properties of the new methods and 2) to empirically rank the new methods, showing improvement over one-step and Monte Carlo methods. Our results are restricted to model-free, table-lookup methods and to offline updating (at the end of each episode) although several of the algorithms could be applied more generally.

815 citations

Journal Article•10.1023/A:1007678930559•
Convergence Results for Single-Step On-PolicyReinforcement-Learning Algorithms

[...]

Satinder Singh1, Tommi S. Jaakkola2, Michael L. Littman3, Csaba Szepesvári•
AT&T Labs1, Massachusetts Institute of Technology2, Duke University3
01 Mar 2000-Machine Learning
TL;DR: This paper examines the convergence of single-step on-policy RL algorithms for control with both decaying exploration and persistent exploration and provides examples of exploration strategies that result in convergence to both optimal values and optimal policies.
Abstract: An important application of reinforcement learning (RL) is to finite-state control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform exploration. In this paper, we examine the convergence of single-step on-policy RL algorithms for control. On-policy algorithms cannot separate exploration from learning and therefore must confront the exploration problem directly. We prove convergence results for several related on-policy algorithms with both decaying exploration and persistent exploration. We also provide examples of exploration strategies that can be followed during learning that result in convergence to both optimal values and optimal policies.

799 citations

Journal Article•10.1109/89.817450•
A stochastic model of human-machine interaction for learning dialog strategies

[...]

Esther Levin1, Roberto Pieraccini2, Wieland Eckert2•
Bell Labs1, AT&T2
01 Jan 2000-IEEE Transactions on Speech and Audio Processing
TL;DR: The experimental results show that it is indeed possible to find a simple criterion, a state space representation, and a simulated user parameterization in order to automatically learn a relatively complex dialog behavior, similar to one that was heuristically designed by several research groups.
Abstract: We propose a quantitative model for dialog systems that can be used for learning the dialog strategy. We claim that the problem of dialog design can be formalized as an optimization problem with an objective function reflecting different dialog dimensions relevant for a given application. We also show that any dialog system can be formally described as a sequential decision process in terms of its state space, action set, and strategy. With additional assumptions about the state transition probabilities and cost assignment, a dialog system can be mapped to a stochastic model known as Markov decision process (MDP). A variety of data driven algorithms for finding the optimal strategy (i.e., the one that optimizes the criterion) is available within the MDP framework, based on reinforcement learning. For an effective use of the available training data we propose a combination of supervised and reinforcement learning: the supervised learning is used to estimate a model of the user, i.e., the MDP parameters that quantify the user's behavior. Then a reinforcement learning algorithm is used to estimate the optimal strategy while the system interacts with the simulated user. This approach is tested for learning the strategy in an air travel information system (ATIS) task. The experimental results we present in this paper show that it is indeed possible to find a simple criterion, a state space representation, and a simulated user parameterization in order to automatically learn a relatively complex dialog behavior, similar to one that was heuristically designed by several research groups.

698 citations

Journal Article•10.1137/S0363012997331639•
The O.D. E. Method for Convergence of Stochastic Approximation and Reinforcement Learning

[...]

Vivek S. Borkar, Sean P. Meyn1•
University of Illinois at Urbana–Champaign1
01 Jan 2000-Siam Journal on Control and Optimization
TL;DR: It is shown here that Stability of the stochastic approximation algorithm is implied by the asymptotic stability of the origin for an associated ODE, which implies convergence of the algorithm.
Abstract: It is shown here that stability of the stochastic approximation algorithm is implied by the asymptotic stability of the origin for an associated ODE. This in turn implies convergence of the algorithm. Several specific classes of algorithms are considered as applications. It is found that the results provide (i) a simpler derivation of known results for reinforcement learning algorithms; (ii) a proof for the first time that a class of asynchronous stochastic approximation algorithms are convergent without using any a priori assumption of stability; (iii) a proof for the first time that asynchronous adaptive critic and Q-learning algorithms are convergent for the average cost optimal control problem.

627 citations

Proceedings Article•
An Algorithm for Distributed Reinforcement Learning in Cooperative Multi-Agent Systems

[...]

Martin Lauer, Martin Riedmiller
29 Jun 2000

523 citations

Book Chapter•10.1007/3-540-44640-0_15•
An Algorithmic Description of ACS2

[...]

Martin V. Butz1, Wolfgang Stolzmann2•
University of Würzburg1, Daimler AG2
15 Sep 2000
TL;DR: This chapter provides an overview over the ACS system including all parameters as well as framework, structure, and environmental interaction, and a precise description of all algorithms in ACS2 is provided.
Abstract: The various modifications and extensions of the anticipatory classifier system (ACS) recently led to the introduction of ACS2, an enhanced and modified version of ACS. This chapter provides an overview over the system including all parameters as well as framework, structure, and environmental interaction. Moreover, a precise description of all algorithms in ACS2 is provided.

462 citations

Journal Article•10.1162/0899766053011528•
Robust Reinforcement Learning

[...]

Jun Morimoto, Kenji Doya1•
Okinawa Institute of Science and Technology1
1 Jan 2000
TL;DR: A new reinforcement learning paradigm that explicitly takes into account input disturbance as well as modeling errors is proposed, which is called robust reinforcement learning (RRL) and tested on the control task of an inverted pendulum.
Abstract: This letter proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors. The use of environmental models in RL is quite popular for both offline learning using simulations and for online action planning. However, the difference between the model and the real environment can lead to unpredictable, and often unwanted, results. Based on the theory of H∞ control, we consider a differential game in which a "disturbing" agent tries to make the worst possible disturbance while a "control" agent tries to make the best control input. The problem is formulated as finding a min-max solution of a value function that takes into account the amount of the reward and the norm of the disturbance. We derive online learning algorithms for estimating the value function and for calculating the worst disturbance and the best control in reference to the value function. We tested the paradigm, which we call robust reinforcement learning (RRL), on the control task of an inverted pendulum. In the linear domain, the policy and the value function learned by online algorithms coincided with those derived analytically by the linear H∞ control theory. For a fully nonlinear swing-up task, RRL achieved robust performance with changes in the pendulum weight and friction, while a standard reinforcement learning algorithm could not deal with these changes. We also applied RRL to the cart-pole swing-up task, and a robust swing-up policy was acquired.

400 citations

Temporal abstraction in reinforcement learning

[...]

Doina Precup, Richard S. Sutton
1 Jan 2000
TL;DR: A general framework for prediction, control and learning at multiple temporal scales, and the way in which multi-time models can be used to produce plans of behavior very quickly, using classical dynamic programming or reinforcement learning techniques is developed.
Abstract: Decision making usually involves choosing among different courses of action over a broad range of time scales. For instance, a person planning a trip to a distant location makes high-level decisions regarding what means of transportation to use, but also chooses low-level actions, such as the movements for getting into a car. The problem of picking an appropriate time scale for reasoning and learning has been explored in artificial intelligence, control theory and robotics. In this dissertation we develop a framework that allows novel solutions to this problem, in the context of Markov Decision Processes (MDPs) and reinforcement learning. In this dissertation, we present a general framework for prediction, control and learning at multiple temporal scales. In this framework, temporally extended actions are represented by a way of behaving (a policy) together with a termination condition. An action represented in this way is called an option. Options can be easily incorporated in MDPs, allowing an agent to use existing controllers, heuristics for picking actions, or learned courses of action. The effects of behaving according to an option can be predicted using multi-time models, learned by interacting with the environment. In this dissertation we develop multi-time models, and we illustrate the way in which they can be used to produce plans of behavior very quickly, using classical dynamic programming or reinforcement learning techniques. The most interesting feature of our framework is that it allows an agent to work simultaneously with high-level and low-level temporal representations. The interplay of these levels can be exploited in order to learn and plan more efficiently and more accurately. We develop new algorithms that take advantage of this structure to improve the quality of plans, and to learn in parallel about the effects of many different options.
Journal Article•10.1613/JAIR.713•
An application of reinforcement learning to dialogue strategy selection in a spoken dialogue system for email

[...]

Marilyn A. Walker1•
AT&T1
01 Feb 2000-Journal of Artificial Intelligence Research
TL;DR: A novel method by which a spoken dialogue system can learn to choose an optimal dialogue strategy from its experience interacting with human users, based on a combination of reinforcement learning and performance modeling of spoken dialogue systems.
Abstract: This paper describes a novel method by which a spoken dialogue system can learn to choose an optimal dialogue strategy from its experience interacting with human users. The method is based on a combination of reinforcement learning and performance modeling of spoken dialogue systems. The reinforcement learning component applies Q-learning (Watkins, 1989), while the performance modeling component applies the PARADISE evaluation framework (Walker et al., 1997) to learn the performance function (reward) used in reinforcement learning. We illustrate the method with a spoken dialogue system named elvis (EmaiL Voice Interactive System), that supports access to email over the phone. We conduct a set of experiments for training an optimal dialogue strategy on a corpus of 219 dialogues in which human users interact with elvis over the phone. We then test that strategy on a corpus of 18 dialogues. We show that elvis can learn to optimize its strategy selection for agent initiative, for reading messages, and for summarizing email folders.
Proceedings Article•
Practical Reinforcement Learning in Continuous Spaces

[...]

William D. Smart1, Leslie Pack Kaelbling2•
Brown University1, Massachusetts Institute of Technology2
29 Jun 2000
TL;DR: This paper introduces an algorithm that safely approximates the value function for continuous state control tasks, and that learns quickly from a small amount of data, and gives experimental results using this algorithm to learn policies for both a simulated task and also for a real robot, operating in an unaltered environment.
Abstract: Dynamic control tasks are good candidates for the application of reinforcement learning techniques. However, many of these tasks inherently have continuous state or action variables. This can cause problems for traditional reinforcement learning algorithms which assume discrete states and actions. In this paper, we introduce an algorithm that safely approximates the value function for continuous state control tasks, and that learns quickly from a small amount of data. We give experimental results using this algorithm to learn policies for both a simulated task and also for a real robot, operating in an unaltered environment. The algorithm works well in a traditional learning setting, and demonstrates extremely good learning when bootstrapped with a small amount of human-provided data.
Proceedings Article•
Acquisition of Stand-up Behavior by a Real Robot using Hierarchical Reinforcement Learning

[...]

Jun Morimoto1, Kenji Doya1•
Nara Institute of Science and Technology1
29 Jun 2000
TL;DR: In this paper, a hierarchical reinforcement learning architecture is proposed to learn a discrete sequence of sub-goals in a low-dimensional state space for achieving the main goal of the task.
Abstract: In this paper, we propose a hierarchical reinforcement learning architecture that realizes practical learning speed in real hardware control tasks. In order to enable learning in a practical number of trials, we introduce a low-dimensional representation of the state of the robot for higher-level planning. The upper level learns a discrete sequence of sub-goals in a low-dimensional state space for achieving the main goal of the task. The lower-level modules learn local trajectories in the original high-dimensional state space to achieve the sub-goal specified by the upper level. We applied the hierarchical architecture to a three-link, two-joint robot for the task of learning to stand up by trial and error. The upper-level learning was implemented by Q-learning, while the lower-level learning was implemented by a continuous actor–critic method. The robot successfully learned to stand up within 750 trials in simulation and then in an additional 170 trials using real hardware. The effects of the setting of the search steps in the upper level and the use of a supplementary reward for achieving sub-goals are also tested in simulation. © 2001 Elsevier Science B.V. All rights reserved.
Journal Article•10.1023/A:1007653114902•
Adaptive Retrieval Agents: Internalizing Local Contextand Scaling up to the Web

[...]

Filippo Menczer1, Richard K. Belew2•
University of Iowa1, University of California, San Diego2
01 May 2000-Machine Learning
TL;DR: This paper discusses a novel distributed adaptive algorithm and representation used to construct populations of adaptive Web agents that browse networked information environments on-line in search of pages relevant to the user, by traversing hyperlinks in an autonomous and intelligent fashion.
Abstract: This paper discusses a novel distributed adaptive algorithm and representation used to construct populations of adaptive Web agents. These InfoSpiders browse networked information environments on-line in search of pages relevant to the user, by traversing hyperlinks in an autonomous and intelligent fashion. Each agent adapts to the spatial and temporal regularities of its local context thanks to a combination of machine learning techniques inspired by ecological models: evolutionary adaptation with local selection, reinforcement learning and selective query expansion by internalization of environmental signals, and optional relevance feedback. We evaluate the feasibility and performance of these methods in three domains: a general class of artificial graph environments, a controlled subset of the Web, and (preliminarly) the full Web. Our results suggest that InfoSpiders could take advantage of the starting points provided by search engines, based on global word statistics, and then use linkage topology to guide their search on-line. We show how this approach can complement the current state of the art, especially with respect to the scalability challenge.
An Analysis of Stochastic Game Theory for Multiagent Reinforcement Learning

[...]

Michael Bowling, Manuela Veloso
1 Oct 2000
TL;DR: This paper contributes a comprehensive presentation of the relevant techniques for solving stochastic games from both the game theory community and reinforcement learning communities, and examines the assumptions and limitations of these algorithms.
Abstract: : Learning behaviors in a multiagent environment are crucial for developing and adapting multiagent systems. Reinforcement learning techniques have addressed this problem for a single agent acting in a stationary environment, which is modeled as a Markov decision process (MDP). But, multiagent environments are inherently non-stationary since the other agents are free to change their behavior as they also learn and adapt. Stochastic games, first studied in the game theory community, are a natural extension of MDPs to include multiple agents. In this paper we contribute a comprehensive presentation of the relevant techniques for solving stochastic games from both the game theory community and reinforcement learning communities. We examine the assumptions and limitations of these algorithms, and identify similarities between these algorithms, single agent reinforcement learners, and basic game theory techniques.
Proceedings Article•
Algorithm Selection using Reinforcement Learning

[...]

Michail G. Lagoudakis1, Michael L. Littman2•
Duke University1, AT&T2
29 Jun 2000
TL;DR: A kind of MDP that models the algorithm selection problem by allowing multiple state transitions is introduced, and the well known Q-learning algorithm is adapted for this case in a way that combines both Monte-Carlo and Temporal Difference methods.
Abstract: Many computational problems can be solved by multiple algorithms, with different algorithms fastest for different problem sizes, input distributions, and hardware characteristics. We consider the problem of algorithm selection: dynamically choose an algorithm to attack an instance of a problem with the goal of minimizing the overall execution time. We formulate the problem as a kind of Markov decision process (MDP), and use ideas from reinforcement learning to solve it. This paper introduces a kind of MDP that models the algorithm selection problem by allowing multiple state transitions. The well known Q-learning algorithm is adapted for this case in a way that combines both Monte-Carlo and Temporal Difference methods. Also, this work uses, and extends in a way to control problems, the Least-Squares Temporal Difference algorithm (LSTD(0)) of Boyan. The experimental study focuses on the classic problems of order statistic selection and sorting. The encouraging results reveal the potential of applying learning methods to traditional computational problems.
Journal Article•10.1098/RSTA.2000.0593•
Probabilistic methods in spoken–dialogue systems

[...]

Steve Young1•
University of Cambridge1
15 Apr 2000-Philosophical transactions - Royal Society. Mathematical, physical and engineering sciences
TL;DR: A probabilistic framework for modelling spoken–dialogue systems on the assumption that the overall system behaviour can be represented as a Markov decision process is presented and the optimization of dialogue–management strategy using reinforcement learning is reviewed.
Abstract: This paper presents a probabilistic framework for modelling spoken–dialogue systems. On the assumption that the overall system behaviour can be represented as a Markov decision process, the optimization of dialogue–management strategy using reinforcement learning is reviewed. Examples of learning behaviour are presented for both dynamic programming and sampling methods, but the latter are preferred. The paper concludes by noting the importance of user simulation models for the practical application of these techniques and the need for developing methods of mapping system features in order to achieve sufficiently compact state spaces.
Proceedings Article•10.1109/ISCAS.2000.856049•
Direct gradient-based reinforcement learning

[...]

Jonathan Baxter1, Peter L. Bartlett1•
Australian National University1
28 May 2000
TL;DR: An algorithm for computing approximations to the gradient of the average reward from a single sample path of a controlled partially observable Markov decision process is presented and it is proved that the algorithm converges with probability 1.
Abstract: Many control, scheduling, planning and game-playing tasks can be formulated as reinforcement learning problems, in which an agent chooses actions to take in some environment, aiming to maximize a reward function. We present an algorithm for computing approximations to the gradient of the average reward from a single sample path of a controlled partially observable Markov decision process. We show that the accuracy of these approximations depends on the relationship between a time constant used by the algorithm and the mixing time of the Markov chain, and that the error can be made arbitrarily small by setting the time constant suitably large. We prove that the algorithm converges with probability 1.
Journal Article•10.1016/S0165-1889(99)00015-9•
Learning by doing and the value of optimal experimentation

[...]

Volker Wieland1•
Federal Reserve System1
01 Apr 2000-Journal of Economic Dynamics and Control
TL;DR: In this paper, the optimal policy for a learning by doing problem is characterized using numerical methods and shown to have a substantial degree of experimentation under a wide range of initial beliefs about the unknown parameters.
Journal Article•10.1109/49.824797•
Call admission control and routing in integrated services networks using neuro-dynamic programming

[...]

Peter Marbach1, O. Mihatsch, John N. Tsitsiklis•
Massachusetts Institute of Technology1
01 Feb 2000-IEEE Journal on Selected Areas in Communications
TL;DR: Methods of neuro-dynamic programming [reinforcement learning (RL)], together with a decomposition approach, are used to construct dynamic (state-dependent) call admission control and routing policies based on state-dependent link costs.
Abstract: We consider the problem of call admission control (CAC) and routing in an integrated services network that handles several classes of calls of different value and with different resource requirements. The problem of maximizing the average value of admitted calls per unit time (or of revenue maximization) is naturally formulated as a dynamic programming problem, but is too complex to allow for an exact solution. We use methods of neuro-dynamic programming (NDP) [reinforcement learning (RL)], together with a decomposition approach, to construct dynamic (state-dependent) call admission control and routing policies. These policies are based on state-dependent link costs, and a simulation-based learning method is employed to tune the parameters that define these link costs. A broad set of experiments shows the robustness of our policy and compares its performance with a commonly used heuristic.
Proceedings Article•
Programmable Reinforcement Learning Agents

[...]

David Andre1, Stuart Russell1•
University of California, Berkeley1
1 Jan 2000
TL;DR: Together, the methods presented in this work comprise a system for agent design that allows the programmer to specify what they know, hint at what they suspect using soft shaping, and leave unspecified that which they don't know; the system then optimally completes the program through experience and takes advantage of the hierarchical structure of the specified program to speed learning.
Abstract: This dissertation examines the use of partial programming as a means of designing agents for large Markov Decision Problems. In this approach, a programmer specifies only that which they know to be correct and the system then learns the rest from experience using reinforcement learning. In contrast to previous low-level languages for partial programming, this dissertation presents ALisp, a Lisp-based high-level partial programming language. ALisp allows the programmer to constrain the policies considered by a learning process and to express his or her prior knowledge in a concise manner. Optimally completing a partial ALisp program is shown to be equivalent to solving a Semi-Markov Decision Problem (SMDP). Under a finite memory-use condition, online learning algorithms for ALisp are proved to converge to an optimal solution of the SMDP and thus to an optimal completion of the partial program. This dissertation then presents methods for exploiting the modularity allows an agent to ignore aspects of its current state that are irrelevant to its current decision, and therefore speeds up reinforcement learning. By decomposing representations of the value of actions along subroutine boundaries, optimality, i.e., optimality among all policies consistent with the partial program. These methods are demonstrated on two simulated taxi tasks. Function approximation, a method for representing the value of actions, allows reinforcement learning to be applied to problems where exact methods are intractable. Soft shaping is a method for guiding an agent toward a solution without constraining the search space. Both can be integrated with ALisp. ALisp with function approximation and reward shaping is successfully applied on a difficult continuous variant of the simulated taxi task. Together, the methods presented in this work comprise a system for agent design that allows the programmer to specify what they know, hint at what they suspect using soft shaping, and leave unspecified that which they don't know; the system then optimally completes the program through experience and takes advantage of the hierarchical structure of the specified program to speed learning.
Proceedings Article•
Advisor: a machine-learning architecture for intelligent tutor construction

[...]

Joseph E. Beck, Beverly Park Woolf, Carole R. Beal
30 Jul 2000
TL;DR: ADVISOR, a two-agent machine learning architecture for intelligent tutoring systems (ITS), is constructed to centralize the reasoning of an ITS into a single component to allow customization of teaching goals and to simplify improving the ITS.
Abstract: We have constructed ADVISOR, a two-agent machine learning architecture for intelligent tutoring systems (ITS). The purpose of this architecture is to centralize the reasoning of an ITS into a single component to allow customization of teaching goals and to simplify improving the ITS. The first agent is responsible for learning a model of how students perform using the tutor in a variety of contexts. The second agent is provided this model of student behavior and a goal specifying the desired educational objective. Reinforcement learning is used by this agent to derive a teaching policy that meets the specified educational goal. Component evaluation studies show each agent performs adequately in isolation. We have also conducted an evaluation with actual students of the complete architecture. Results show ADVISOR was successful in learning a teaching policy that met the educational objective provided. Although this set of machine learning agents has been integrated with a specific intelligent tutor, the general technique could be applied to a broad class of ITS.
An Analysis of Actor-Critic Algorithms Using Eligibility Traces : Reinforcement Learning with Imperfect Value Functions

[...]

Hajime Kimura, Shigenobu Kobayashi
1 Mar 2000
TL;DR: The results show that the algorithm is an extension of Williams' REINFORCE algorithms for in nite horizon reinforcement tasks, and then the critic provides an appropriate reinforcement baseline for the actor.
Abstract: We present an analysis of actor/critic algorithms, in which the actor updates its policy using eligibility traces of the policy parameters. Most of the theoretical results for eligibility traces have been for only critic's value iteration algorithms. This paper investigates what the actor's eligibility trace does. The results show that the algorithm is an extension of Williams' REINFORCE algorithms for in nite horizon reinforcement tasks, and then the critic provides an appropriate reinforcement baseline for the actor. Thanks to the actor's eligibility trace, the actor improves its policy by using a gradient of actual return, not by using a gradient of the estimated return in the critic. It enables the agent to learn a fairly good policy under the condition that the approximated value function in the critic is hopelessly inaccurate for conventional actor/critic algorithms. Also, if an accurate value function is estimated by the critic, the actor's learning is dramatically accelerated in our test cases. The behavior of the algorithm is demonstrated through simulations of a linear quadratic control problem and a pole balancing problem.
Proceedings Article•
Reinforcement Learning in POMDP's via Direct Gradient Ascent

[...]

Jonathan Baxter1, Peter L. Bartlett1•
Australian National University1
29 Jun 2000
TL;DR: GPOMDP, a REINFORCE-like algorithm for estimating an approximation to the gradient of the average reward as a function of the parameters of a stochastic policy, is introduced and it is proved convergence of GPOMDP.
Abstract: This paper discusses theoretical and experimental aspects of gradient-based approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCE-like algorithm for estimating an approximation to the gradient of the average reward as a function of the parameters of a stochastic policy. The algorithm’s chief advantages are that it requires only a single sample path of the underlying Markov chain, it uses only one free parameter 2 [0; 1), which has a natural interpretation in terms of bias-variance trade-off, and it requires no knowledge of the underlying state. We prove convergence of GPOMDP and show how the gradient estimates produced by GPOMDP can be used in a conjugate-gradient procedure to find local optima of the average reward.
Proceedings Article•10.1145/336595.337546•
Learning user's preferences by analyzing Web-browsing behaviors

[...]

Young-Woo Seo1, Byoung-Tak Zhang1•
Seoul National University1
1 Jun 2000
TL;DR: The proposed method observes user's reactions to the filtered documents and learns from them the profiles for the individual users and reinforcement learning is used to adapt the most significant terms that best represent user's interests.
Abstract: This paper describes a method for an information filtering agent to learn user's preferences. The proposed method observes user's reactions to the filtered documents and learns from them the profiles for the individual users. Reinforcement learning is used to adapt the most significant terms that best represent user's interests. In contrast to conventional relevance feedback methods which require explicit user feedbacks, our approach learns user preferences implicitly from direct observations of browsing behaviors during interaction. Field tests have been made which involved 10 users reading a total of 18,750 HTML documents during 45 days. The proposed method showed superior performance in personalized information filtering compared to the existing relevance feedback methods.
Journal Article•10.1109/49.824799•
Adaptive call admission control under quality of service constraints: a reinforcement learning solution

[...]

Hui Tong1, Timothy X. Brown•
University of Colorado Boulder1
01 Feb 2000-IEEE Journal on Selected Areas in Communications
TL;DR: This work solves the adaptive call admission control (CAC) problem in multimedia networks via reinforcement learning (RL) and is able to earn significantly higher revenues than alternative heuristics.
Abstract: We solve the adaptive call admission control (CAC) problem in multimedia networks via reinforcement learning (RL). The problem requires that network revenue be maximized while simultaneously meeting quality of service (QoS) constraints that forbid entry into certain states and use of certain actions. We show that RL provides a solution to this constrained semi-Markov decision problem and is able to earn significantly higher revenues than alternative heuristics. Unlike other model-based algorithms, RL does not require the explicit state transition models to solve the decision problems. This feature is very important if one considers large integrated service networks supporting a number of different service types, where the number of states is so large that model-based optimization algorithms are infeasible. Both packet-level and call-level QoS constraints are addressed, and both conservative and aggressive approaches to the QoS constraints are considered. Results are demonstrated on a single link and extended to routing on a multilink network.
Next Generation Intrusion Detection: Autonomous Reinforcement Learning of Network Attacks

[...]

Cannady1, D James•
Georgia Institute of Technology1
1 Jan 2000
TL;DR: A new approach to applying adaptive neural networks to intrusion detection that is capable of autonomously learning new attacks rapidly through the use of a modified reinforcement learning method that uses feedback from the protected system is presented.
Abstract: The timely and accurate detection of computer and network system intrusions has always been an elusive goal for system administrators and information security researchers. Existing intrusion detection approaches require either manual coding of new attacks in expert systems or the complete retraining of a neural network to improve analysis or learn new attacks. This paper presents a new approach to applying adaptive neural networks to intrusion detection that is capable of autonomously learning new attacks rapidly through the use of a modified reinforcement learning method that uses feedback from the protected system. The approach has been demonstrated to be extremely effective in learning new attacks, detecting previously learned attacks in a network data stream, and in autonomously improving its analysis over time using feedback from the protected system.
Book Chapter•10.1007/3-540-45027-0_4•
An Introduction to Learning Fuzzy Classifier Systems

[...]

Andrea Bonarini1•
Polytechnic University of Milan1
01 Jan 2000-Lecture Notes in Computer Science
TL;DR: A class of Learning Classifier Systems that learn fuzzy rule-based models, instead of interval-based or Boolean models is presented, and a general methodology to extend reinforcement distribution algorithms usually not designed to learn fuzzy models is introduced.
Abstract: We present a class of Learning Classifier Systems that learn fuzzy rule-based models, instead of interval-based or Boolean models. We discuss some motivations to consider Learning Fuzzy Classifier Systems (LFCS) as a promising approach to learn mappings from real-valued input to real-valued output, basing on data interpretation implemented by fuzzy sets. We describe some of the approaches explicitly or implicitly referring to this research area, presented in literature since the beginning of the last decade. We also show how the general LFCS model can be considered as a framework for a wide range of systems, each implementing in a different way the modules composing the basic architecture. We also mention some of the applications of LFCS presented in literature, which show the potentialities of this type of systems. Finally, we introduce a general methodology to extend reinforcement distribution algorithms usually not designed to learn fuzzy models. This opens new application possibilities.
Journal Article•10.1023/A:1007686309208•
A Study of Reinforcement Learning in the Continuous Case by the Means of Viscosity Solutions

[...]

Rémi Munos1•
Carnegie Mellon University1
01 Sep 2000-Machine Learning
TL;DR: A general convergence theorem is derived for RL algorithms when one uses only “approximations” of the initial data, which can be used for model-based or model-free RL algorithms, with off-line or on-line updating methods, for deterministic or stochastic state dynamics, and based on FE or FD discretization methods.
Abstract: This paper proposes a study of Reinforcement Learning (RL) for continuous state-space and time control problems, based on the theoretical framework of viscosity solutions (VSs). We use the method of dynamic programming (DP) which introduces the value function (VF), expectation of the best future cumulative reinforcement. In the continuous case, the value function satisfies a non-linear first (or second) order (depending on the deterministic or stochastic aspect of the process) differential equation called the Hamilton-Jacobi-Bellman (HJB) equation. It is well known that there exists an infinity of generalized solutions (differentiable almost everywhere) to this equation, other than the VF. We show that gradient-descent methods may converge to one of these generalized solutions, thus failing to find the optimal control. In order to solve the HJB equation, we use the powerful framework of viscosity solutions and state that there exists a unique viscosity solution to the HJB equation, which is the value function. Then, we use another main result of VSs (their stability when passing to the limit) to prove the convergence of numerical approximations schemes based on finite difference (FD) and finite element (FE) methods. These methods discretize, at some resolution, the HJB equation into a DP equation of a Markov Decision Process (MDP), which can be solved by DP methods (thanks to a “strong” contraction property) if all the initial data (the state dynamics and the reinforcement function) were perfectly known. However, in the RL approach, as we consider a system in interaction with some a priori (at least partially) unknown environment, which learns “from experience”, the initial data are not perfectly known but have to be approximated during learning. The main contribution of this work is to derive a general convergence theorem for RL algorithms when one uses only “approximations” (in a sense of satisfying some “weak” contraction property) of the initial data. This result can be used for model-based or model-free RL algorithms, with off-line or on-line updating methods, for deterministic or stochastic state dynamics (though this latter case is not described here), and based on FE or FD discretization methods. It is illustrated with several RL algorithms and one numerical simulation for the “Car on the Hill” problem.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve