TL;DR: This book attempts to give an overview of the different recent efforts to deal with covariate shift, a challenging situation where the joint distribution of inputs and outputs differs between the training and test stages.
Abstract: All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from this is the last candidate. next esc will revert to uncompleted text. he publisher. Overview Dataset shift is a challenging situation where the joint distribution of inputs and outputs differs between the training and test stages. Covariate shift is a simpler particular case of dataset shift where only the input distribution changes (covariate denotes input), while the conditional distribution of the outputs given the inputs p(y|x) remains unchanged. Dataset shift is present in most practical applications for reasons ranging from the bias introduced by experimental design, to the mere irreproducibility of the testing conditions at training time. For example, in an image classification task, training data might have been recorded under controlled laboratory conditions, whereas the test data may show different lighting conditions. In other applications, the process that generates data is in itself adaptive. Some of our authors consider the problem of spam email filtering: successful " spammers " will try to build spam in a form that differs from the spam the automatic filter has been built on. Dataset shift seems to have raised relatively little interest in the machine learning community until very recently. Indeed, many machine learning algorithms are based on the assumption that the training data is drawn from exactly the same distribution as the test data on which the model will later be evaluated. Semi-supervised learning and active learning, two problems that seem very similar to covariate shift have received much more attention. How do they differ from covariate shift? Semi-supervised learning is designed to take advantage of unlabeled data present at training time, but is not conceived to be robust against changes in the input distribution. In fact, one can easily construct examples of covariate shift for which common SSL strategies such as the " cluster assumption " will lead to disaster. In active learning the algorithm is asked to select from the available unlabeled inputs those for which obtaining the label will be most beneficial for learning. This is very relevant in contexts where labeling data is very costly, but active learning strategies 2 Contents are not specifically design for dealing with covariate shift. This book attempts to give an overview of the different recent efforts that are being …
TL;DR: This work proposes a representation for collaborative filtering tasks that allows the application of virtually any machine learning algorithm, and identifies the shortcomings of current collaborative filtering techniques and proposes the use of learning algorithms paired with feature extraction techniques that specifically address the limitations of previous approaches.
Abstract: Predicting items a user would like on the basis of other users’ ratings for these items has become a well-established strategy adopted by many recommendation services on the Internet. Although this can be seen as a classification problem, algorithms proposed thus far do not draw on results from the machine learning literature. We propose a representation for collaborative filtering tasks that allows the application of virtually any machine learning algorithm. We identify the shortcomings of current collaborative filtering techniques and propose the use of learning algorithms paired with feature extraction techniques that specifically address the limitations of previous approaches. Our best-performing algorithm is based on the singular value decomposition of an initial matrix of user ratings, exploiting latent structure that essentially eliminates the need for users to rate common items in order to become predictors for one another's preferences. We evaluate the proposed algorithm on a large database of user ratings for motion pictures and find that our approach significantly outperforms current collaborative filtering algorithms.
TL;DR: The deterministic annealing approach to clustering and its extensions has demonstrated substantial performance improvement over standard supervised and unsupervised learning methods in a variety of important applications including compression, estimation, pattern recognition and classification, and statistical regression.
Abstract: The deterministic annealing approach to clustering and its extensions has demonstrated substantial performance improvement over standard supervised and unsupervised learning methods in a variety of important applications including compression, estimation, pattern recognition and classification, and statistical regression. The application-specific cost is minimized subject to a constraint on the randomness of the solution, which is gradually lowered. We emphasize the intuition gained from analogy to statistical physics. Alternatively the method is derived within rate-distortion theory, where the annealing process is equivalent to computation of Shannon's rate-distortion function, and the annealing temperature is inversely proportional to the slope of the curve. The basic algorithm is extended by incorporating structural constraints to allow optimization of numerous popular structures including vector quantizers, decision trees, multilayer perceptrons, radial basis functions, and mixtures of experts.
TL;DR: It is shown that very simple templates are suu-cient, and that performance improves with more user interaction, and the Diverse Density algorithm which is a method of learning from ambiguous examples is discussed.
Abstract: We investigate a method called Multiple-Instance learning for learning simple templates that capture the color and spatial properties of classes of natural scene images from a small set of examples. These templates encode a scene class as image patches with color and spatial relations and can be used to classify a variety of natural scenes like elds, waterfalls and mountains among others. Example images are ambiguous since there are many possible templates that can describe an individual image. Multiple-Instance learning makes the ambiguity explicit, and we discuss the Diverse Density algorithm which is a method of learning from ambiguous examples. The system uses very low resolution images to extract the templates from a set of examples. Once a template is learned, we use the COREL photo library to test its retrieval rates and accuracy. We show that very simple templates are suu-cient, and that performance improves with more user interaction.
TL;DR: The paper defines a hierarchical Q learning algorithm, proves its convergence, and shows experimentally that it can learn much faster than ordinary “flat” Q learning.
Abstract: This paper presents a new approach to hierarchical reinforcement learning based on the MAXQ decomposition of the value function. The MAXQ decomposition has both a procedural semantics—as a subroutine hierarchy—and a declarative semantics—as a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. Conditions under which the MAXQ decomposition can represent the optimal value function are derived. The paper defines a hierarchical Q learning algorithm, proves its convergence, and shows experimentally that it can learn much faster than ordinary “flat” Q learning. Finally, the paper discusses some interesting issues that arise in hierarchical reinforcement learning including the hierarchical credit assignment problem and non-hierarchical execution of the MAXQ hierarchy.
TL;DR: A very simple “baseline architecture” for a learning agent that can handle stochastic, partially observable environments is proposed together with a method for representing temporal processes as graphical models and whether reinforcement learning can provide a good model of animal and human learning is discussed.
Abstract: This talk proposes a very simple “baseline architecture” for a learning agent that can handle stochastic, partially observable environments. The architecture uses reinforcement learning together with a method for representing temporal processes as graphical models. I will discuss methods for leaming the parameters and structure of such representations from sensory inputs, and for computing posterior probabilities. Some open problems remain before we can try out the complete agent; more arise when we consider scaling up. A second theme of the talk will be whether reinforcement learning can provide a good model of animal and human learning. To answer this question, we must do inverse reinforcement learning: given the observed behaviour, what reward signal, if any, is being optimized? This seems to be a very interesting problem for the COLT, UAI, and ML communities, and has been addressed in econometrics under the heading of structural estimation of Markov decision processes. 1 Learning in uncertain environments AI is about the construction of intelligent agents, i.e., systems that perceive and act effectively (according to some performance measure) in an environment. I have argued elsewhere Russell and Norvig (1995) that most AI research has focused on environments that are static, deterministic, discrete, and fully observable. What is to be done when, as in the real world, the environment is dynamic, stochastic, continuous, and partially observable? ‘This paper draws on a variety of research efforts supported by NSF @I-9634215), ONR (N00014-97-l-0941), and AR0 (DAAH04-96-1-0341). Permission to make digital or hard copies of all or p.art of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prolit or commercial adwantage and that copies bear this notice and the full citation on the first page. To copy otherwise. to republish, to post on servers or to redistribute to lists, requires prior specific pemlission and/or a fee. COLT 98 Madison WI IJSA Copyright ACM 1998 1-5X1 13-057--0/9X/ 7...$5.00 In recent years, reinforcement learning (also called neurodynamic programming) has made rapid progress as an approachfor building agents automatically (Sutton, 1988; Kaelbling et al., 1996; Bertsekas & Tsitsiklis, 1996). The basic idea is that the performance measure is made available to the agent in the form of a rewardfunction specifying the reward for each state that the agent passes through. The performance measure is then the sum of the rewards obtained. For example, when a bumble bee forages, the reward function at each time step might be some combination of the distance flown (weighted negatively) and the nectar ingested. Reinforcement learning (RL) methods are essentially online algorithmd for solving Markovdecisionprocesses (MDPs). An MDP is defined by the reward function and a model, that is, the state transition probabilities conditioned on each possible action. RL algorithms can be model-based, where the agent learns a model, or model-free-e.g., Q-learning citeWatkins: 1989, which learns just a function Q(s, a) specifying the long-term value of taking action a in state s and acting optimally thereafter. Despite their successes, RL methods have been restricted largely tofully observable MDPs, in which the sensory input at each state is sufficient to identify the state. Obviously, in the real world, we must often deal with partially observable MDPs (POMDPs). Astrom (1965) proved that optimal decisions in POMDPs depend on the belief state b at each point in time, i.e., the posterior probability distribution over all possible actual states, given all evidence to date. The functions V and Q then become functions of b instead of s. Parr and Russell (1995) describes a very simple POMDP RL algorithm using an explicit representation of b as a vector of probabilities, and McCallum (1993) shows a way to approximate the belief state using recent percept sequences. Neither approach is likely to scale up to situations with large numbers of state variables and long-term temporal dependencies. What is needed is a way of representing the model compactly and updating the belief state efficiently given the model and each new observation. Dynamic Bayesian networks (Dean & Kanazawa, 1989) seem to have some of the required properties; in particular, they have significant advantages over other approaches such as Kalman filters and hidden Markov models. Our baseline architecture, shown in Figure 1, uses DBNs to represent and update the belief state as new sensor information arrives. Given a representation for b, the reward signal is used to learn a Q-function represented by some “black-box” function approximator such as a neural network. Provided we can handle hybrid (dis-
TL;DR: A simple learning rule is derived, the VAPS algorithm, which can be instantiated to generate a wide range of new reinforcement-learning algorithms, and allows policy-search and value-based algorithms to be combined, thus unifying two very different approaches to reinforcement learning into a single Value and Policy Search algorithm.
Abstract: A simple learning rule is derived, the VAPS algorithm, which can be instantiated to generate a wide range of new reinforcement-learning algorithms. These algorithms solve a number of open problems, define several new approaches to reinforcement learning, and unify different approaches to reinforcement learning under a single theory. These algorithms all have guaranteed convergence, and include modifications of several existing algorithms that were known to fail to converge on simple MOPs. These include Q-learning, SARSA, and advantage learning. In addition to these value-based algorithms it also generates pure policy-search reinforcement-learning algorithms, which learn optimal policies without learning a value function. In addition, it allows policy-search and value-based algorithms to be combined, thus unifying two very different approaches to reinforcement learning into a single Value and Policy Search (VAPS) algorithm. And these algorithms converge for POMDPs without requiring a proper belief state. Simulations results are given, and several areas for future research are discussed.
TL;DR: Unsupervised self-organizing maps, as well as supervised learning by Learning Vector Quantization (LVQ) can be defined for string variables, too when the SOM and the LVQ algorithms are expressed as batch versions.
TL;DR: This risk-sensitive reinforcement learning algorithm is based on a very different philosophy and reflects important properties of the classical exponential utility framework, but avoids its serious drawbacks for learning.
Abstract: Most reinforcement learning algorithms optimize the expected return of a Markov Decision Problem. Practice has taught us the lesson that this criterion is not always the most suitable because many applications require robust control strategies which also take into account the variance of the return. Classical control literature provides several techniques to deal with risk-sensitive optimization goals like the so-called worst-case optimality criterion exclusively focusing on risk-avoiding policies or classical risk-sensitive control, which transforms the returns by exponential utility functions. While the first approach is typically too restrictive, the latter suffers from the absence of an obvious way to design a corresponding model-free reinforcement learning algorithm.
Our risk-sensitive reinforcement learning algorithm is based on a very different philosophy. Instead of transforming the return of the process, we transform the temporal differences during learning. While our approach reflects important properties of the classical exponential utility framework, we avoid its serious drawbacks for learning. Based on an extended set of optimality equations we are able to formulate risk-sensitive versions of various well-known reinforcement learning algorithms which converge with probability one under the usual conditions.
TL;DR: The authors presented an algorithm combining variants of Winnow and weighted-majority voting, and applied it to a problem in the aforementioned class: context-sensitive spelling correction, which is the task of fixing spelling errors that happen to result in valid words, such as substituting "to" for "too", "casual" for 'causal", etc.
Abstract: A large class of machine-learning problems in natural language require the characterization of linguistic context. Two characteristic properties of such problems are that their feature space is of very high dimensionality, and their target concepts refer to only a small subset of the features in the space. Under such conditions, multiplicative weight-update algorithms such as Winnow have been shown to have exceptionally good theoretical properties. We present an algorithm combining variants of Winnow and weighted-majority voting, and apply it to a problem in the aforementioned class: context-sensitive spelling correction. This is the task of fixing spelling errors that happen to result in valid words, such as substituting "to" for "too", "casual" for "causal", etc. We evaluate our algorithm, WinSpell, by comparing it against BaySpell, a statistics-based method representing the state of the art for this task. We find: (1) When run with a full (unpruned) set of features, WinSpell achieves accuracies significantly higher than BaySpell was able to achieve in either the pruned or unpruned condition; (2) When compared with other systems in the literature, WinSpell exhibits the highest performance; (3) The primary reason that WinSpell outperforms BaySpell is that WinSpell learns a better linear separator; (4) When run on a test set drawn from a different corpus than the training set was drawn from, WinSpell is better able than BaySpell to adapt, using a strategy we will present that combines supervised learning on the training set with unsupervised learning on the (noisy) test set.
TL;DR: This paper presents the general strategy for designing learning machines as well as a number of particular designs based on two main principles: simple adaptive local models; and adaptive model distribution.
Abstract: This paper presents our general strategy for designing learning machines as well as a number of particular designs. The search for methods allowing a sufficient level of adaptivity are based on two main principles: 1) simple adaptive local models; and 2) adaptive model distribution. Particularly important concepts in our work is mutual information and canonical correlation. Examples are given on learning feature descriptors, modeling disparity, synthesis of a global 3-mode model and a setup for reinforcement learning of online video coder parameter control.
TL;DR: It is shown that k-DNF and k-decision lists are learnable in both models, i.e. with far less information than it is assumed in previously used algorithms.
Abstract: Learning from positive examples occurs very frequently in natural learning. The PAC learning model of Valiant takes many features of natural learning into account, but in most cases it fails to describe such kind of learning. We show that in order to make the learning from positive data possible, extra-information about the underlying distribution must be provided to the learner. We define a PAC learning model from positive and unlabeled examples. We also define a PAC learning model from positive and unlabeled statistical queries. Relations with PAC model ([Val84]), statistical query model ([Kea93]) and constantpartition classification noise model ([Dec97]) are studied. We show that k-DNF and k-decision lists are learnable in both models, i.e. with far less information than it is assumed in previously used algorithms.
TL;DR: The final chapter modeled the development of viewpoint invariant responses to faces from visual experience in a biological system by encoding spatio-temporal dependencies.
Abstract: In a task such as face recognition, much of the important information may be contained in the high-order relationships among the image pixels. Representations such as "Eigenfaces" (197) and "Holons" (48) are based on Principal component analysis (PCA), which encodes the correlational structure of the input, but does not address high-order statistical dependencies such as relationships among three or more pixels. Independent component analysis (ICA) is a generalization of PCA which encodes the high-order dependencies in the input in addition to the correlations. Representations for face recognition were developed from the independent components of face images. The ICA representations were superior to PCA for recognizing faces across sessions and changes in expression.
ICA was compared to more than eight other image analysis methods on a task of recognizing facial expressions in a project to automate the Facial Action Coding System (62). These methods included estimation of optical flow; representations based on the second-order statistics of the full face images such Eigenfaces (47, 197) local feature analysis (156), and linear discriminant analysis (23); and representations based on the outputs of local filters, such as a Gabor wavelet representations (50, 113) and local PCA (153). The ICA and Gabor wavelet representations achieved the best performance of 96% for classifying 12 facial actions. Relationships between the independent component representation and the Gabor representation are discussed.
Temporal redundancy contains information for learning invariances. Different views of a face tend to appear in close temporal proximity as the person changes expression, pose, or moves through the environment. The final chapter modeled the development of viewpoint invariant responses to faces from visual experience in a biological system by encoding spatio-temporal dependencies. The simulations combined temporal smoothing of activity signals with Hebbian learning (72) in a network with both feed-forward connections and a recurrent layer that was a generalization of a Hopfield attractor network. Following training on sequences of graylevel images of faces as they changed pose, multiple views of a given face fell into the same basin of attraction, and the system acquired representations of faces that were approximately viewpoint invariant.
TL;DR: To increase robustness of machine learning approaches, methods are desirable that can reason about the relatedness of individual learning tasks, in order to avoid the danger arising from tasks that are unrelated and thus potentially misleading.
Abstract: Recently, there has been an increased interest in machine learning methods that transfer knowledge across multiple learning tasks and “learn to learn.” Such methods have repeatedly been found to outperform conventional, single-task learning algorithms when the learning tasks are appropriately related. To increase robustness of such approaches, methods are desirable that can reason about the relatedness of individual learning tasks, in order to avoid the danger arising from tasks that are unrelated and thus potentially misleading.
TL;DR: A model-fitting technique is proposed in which the complete data log-likelihood functional is modeled as an exponential function in the number of classes acting, and this new validation technique is shown to be robust and outperform the ICs in the authors' experiments.
Abstract: Image segmentation is an important and early processing stage in many image analysis problems. Often, this must be done in an unsupervised fashion in that training data is not available and the class-conditioned feature vectors must be estimated directly from the data. A major problem in such applications is the determination of the number of classes actually present in an image. This problem, called the cluster validation problem, remains essentially unsolved. We investigate the cluster validation problem associated with the use of a previously developed unsupervised segmentation algorithm based upon the expectation-maximization (EM) algorithm. More specifically, we consider several well-known information-theoretic criteria (ICs) as candidate solutions to the validation problem when used in conjunction with this EM-based segmentation scheme. We show that these criteria generally provide inappropriate solutions due to the domination of the penalty term by the associated log-likelihood function. As an alternative we propose a model-fitting technique in which the complete data log-likelihood functional is modeled as an exponential function in the number of classes acting. The estimated number of classes are then determined in a manner similar to finding the rise time of the exponential function. This new validation technique is shown to be robust and outperform the ICs in our experiments. Experimental results for both synthetic and real world imagery are detailed.
TL;DR: Borders for the gradient and Hessian of the error considered as a function of centers for networks of fixed size are calculated and it is possible to know by how much one can reduce the error by changing the centers.
Abstract: In radial basis function networks, placement of centers has been one of the problems addressed and has a significant effect on the performance of the network. Supervised learning of center locations in some applications show that they are superior to the networks whose centers are located using unsupervised methods. Supervised learning of centers seem to offset the advantages achieved by the two stage learning of the RBF networks. One way to overcome this may be to train the network with a set of centers selected by unsupervised methods and then to fine tune the centers. This can be done by evaluating whether moving the centers would decrease the error. In this paper we have calculated bounds for the gradient and Hessian of the error considered as a function of centers for networks of fixed size. Using these bounds it is possible to know by how much one can reduce the error by changing the centers. Furthermore, step size can be specified to achieve a guaranteed amount of reduction in error.
TL;DR: A framework to predict the full probability distribution is presented as a mixture model: the dynamics of the individual states is modeled with so-called "experts" (potentially nonlinear neural networks), and the dynamics between the states is modeling using a hidden Markov approach.
Abstract: Most approaches in forecasting merely try to predict the next value of the time series.In contrast, this paper presents a framework to predict the full probability distribution. Itis expressed as a mixture model: the dynamics of the individual states is modeled with so-called"experts" (potentially nonlinear neural networks), and the dynamics between the states is modeledusing a hidden Markov approach. The full density predictions are obtained by a weighted superpositionof the individual densities of each expert. This model class is called "hidden Markov experts".Results are presented for daily S&P500 data. While the predictive accuracy of the mean doesnot improve over simpler models, evaluating the prediction of the full density shows a clear out-of-sampleimprovement both over a simple GARCH(1,l) model (which assumes Gaussian distributedreturns) and over a "gated experts" model (which expresses the weighting for each state non-recursivelyas a function of external inputs). Several interpretations are given: the blending ofsupervised and unsupervised learning, the discovery of hidden states, the combination of forecasts,the specialization of experts, the removal of outliers, and the persistence of volatility.
TL;DR: The results show that a topology preserving behavior quite similar to that of Kohonen's self-organizing map can be achieved using temporal coding, and this model is a further step toward a more realistic description of unsupervised learning in biological neural systems.
Abstract: We propose a mechanism for unsupervised learning in networks of spiking neurons which is based on the timing of single firing events. Our results show that a topology preserving behavior quite similar to that of Kohonen's self-organizing map can be achieved using temporal coding. In contrast to previous approaches, which use rate coding, the winner among competing neurons can be determined fast and locally. Our model is a further step toward a more realistic description of unsupervised learning in biological neural systems. Furthermore, it may provide a basis for fast implementations in pulsed VLSI.
TL;DR: This paper proposes and discusses three experiments on knowledge acquisition using unsupervised and supervised learning techniques and results are promising with a prediction rate higher than 80% having been obtained.
Abstract: The nature of map generalization may be non-uniform along the length of an individual line, requiring the application of methods that adapt to the local geometry and the geographical context. Geographical databases need to be enriched in terms of shape description structures (geometrical knowledge), knowledge of appropriate order of operations and of appropriate algorithms (procedural knowledge). Stored knowledge should take account of semantic and morphological characteristics, and of cartographic constraints.
This paper proposes and discusses three experiments on knowledge acquisition using unsupervised and supervised learning techniques. In order to exploit geometrical shape knowledge, classifications were computed according to a set of morphological measures using unsupervised learning. Choice of appropriate operations was determined by the results of a test with IGN cartographers considering line characteristics. These results were given to a supervised learning algorithm, along with corresponding computed measures in order to discover rules. The approach and the resulting rules are presented and discussed. Tests have also been conducted on the tuning of parameter values, applying a Gaussian smoothing tolerance value to a set of lines using the supervised learning algorithm. The values obtained by means of the learning algorithm have been compared with interactive choices of an expert. Results are promising with a prediction rate higher than 80% having been obtained.
TL;DR: Different search methods for the CNeT are introduced, which are utilized for training as well as for recall, and compared with existing classifiers on a variety of pattern classification problems.
Abstract: Presents competitive neural trees (CNeTs) for pattern classification. The CNeT contains m-ary nodes and grows during learning by using inheritance to initialize new nodes. At the node level, the CNeT employs unsupervised competitive learning. The CNeT performs hierarchical clustering of the feature vectors presented to it as examples, while its growth is controlled by forward pruning. Because of the tree structure, the prototype in the CNeT close to any example can be determined by searching only a fraction of the tree. The paper introduces different search methods for the CNeT, which are utilized for training as well as for recall. The CNeT is evaluated and compared with existing classifiers on a variety of pattern classification problems.
TL;DR: The derivation of an unsupervised learning algorithm, which enables the identification and visualization of latent structure within ensembles of high-dimensional data, which is considered to be a promising method for analysis of real-world data that will consist of sub- and super-Gaussian components such as biomedical signals.
Abstract: This paper presents the derivation of an unsupervised learning algorithm, which enables the identification and visualization of latent structure within ensembles of high-dimensional data. This provides a linear projection of the data onto a lower dimensional subspace to identify the characteristic structure of the observations independent latent causes. The algorithm is shown to be a very promising tool for unsupervised exploratory data analysis and data visualization. Experimental results confirm the attractiveness of this technique for exploratory data analysis and an empirical comparison is made with the recently proposed generative topographic mapping (GTM) and standard principal component analysis (PCA). Based on standard probability density models a generic nonlinearity is developed which allows both (1) identification and visualization of dichotomised clusters inherent in the observed data and (2) separation of sources with arbitrary distributions from mixtures, whose dimensionality may be greater than that of number of sources. The resulting algorithm is therefore also a generalized neural approach to independent component analysis (ICA) and it is considered to be a promising method for analysis of real-world data that will consist of sub- and super-Gaussian components such as biomedical signals.
TL;DR: The performance of vehicle movement under the proposed evolutionary approach is superior compared with that of other approaches based on reinforcement learning that have been applied previously to the same benchmark problem.
TL;DR: A hybrid short-term electrical load forecasting model for a power system control centre that incorporates a Kohonen self-organising feature map with unsupervised learning for classification of daily load patterns, a supervised backpropagation neural network for mapping the temperature/load relationship, and a fuzzy expert system for postprocessing of neural network outputs is presented.
Abstract: The paper presents the development and practical implementation of a hybrid short-term electrical load forecasting model for a power system control centre. This hybrid architecture incorporates a Kohonen self-organising feature map with unsupervised learning for classification of daily load patterns, a supervised backpropagation neural network for mapping the temperature/load relationship, and a fuzzy expert system for postprocessing of neural network outputs. This load forecaster requires minimum operator intervention and can be trained adaptively on-line. The developed model has been tested extensively in the actual operating environment and has been shown to outperform the existing regression-based model.
TL;DR: SUSTAIN (Supervised and Unsupervised STratified Adaptive Incremental Network) as mentioned in this paper ) is a network model of human category learning, which is a three layer model where learning between the first two layers is unsupervised, while learning in the top two layers are supervised.
Abstract: SUSTAIN (Supervised and Unsupervised STratified Adaptive Incremental Network) is a network model of human category learning. SUSTAIN is a three layer model where learning between the first two layers is unsupervised, while learning between the top two layers is supervised. SUSTAIN clusters inputs in an unsupervised fashion until it groups input patterns inappropriately (as signaled by the supervised portion of the network). When such an error occurs, SUSTAIN alters its architecture, recruiting a new unit that is tuned to correctly classify the exception. Units recruited to capture exceptions can evolve into prototypes/attractor/rules in their own right. SUSTAIN's adaptive architecture allows it to master simple classification problems quickly, while still retaining the capacity to learn difficult mappings. SUSTAIN also adjusts its sensitivity to input dimensions during the course of learning, paying more attention to dimensions relevant to the classification task. Shepard, Hovland, and Jenkins's (1961) challenging category learning data is fit successfully by SUSTAIN. Other applications of SUSTAIN are discussed. SUSTAIN is compared to other classification models.
TL;DR: An unsupervised neural based approach to financial forecasting is presented; its performance is compared with that from a statistical technique and two other standard neural network techniques.
Abstract: An unsupervised neural based approach to financial forecasting is presented; its performance is compared with that from a statistical technique and two other standard neural network techniques. The authors show that the unsupervised network outperforms multilayer perceptrons, radial basis function network and a standard ARIMA model.
TL;DR: Cost formulae are derived for three different parallelisation techniques for training supervised networks that are parameterised by properties of the target computer architecture and take into account optimal batch learning as the overall training approach.
Abstract: We derive cost formulae for three different parallelisation techniques for training supervised networks. These formulae are parameterised by properties of the target computer architecture. It is therefore possible to decide the best match between parallel computer and training technique. One technique, exemplar parallelism, is far superior for almost all parallel computer architectures. Formulae also take into account optimal batch learning as the overall training approach.