TL;DR: This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents, and presents two extensions to the algorithm that improve classification accuracy under these conditions.
Abstract: This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available.
We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve classification accuracy under these conditions: (1) a weighting factor to modulate the contribution of the unlabeled data, and (2) the use of multiple mixture components per class. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.
TL;DR: It is demonstrated that when learning from labeled and unlabeled data, algorithms explicitly leveraging a natural independent split of the features outperform algorithms that do not and may out-perform algorithms not using a split.
Abstract: Recently there has been signi cant interest in supervised learning algorithms that combine labeled and unlabeled data for text learning tasks. The co-training setting [1] applies to datasets that have a natural separation of their features into two disjoint sets. We demonstrate that when learning from labeled and unlabeled data, algorithms explicitly leveraging a natural independent split of the features outperform algorithms that do not. When a natural split does not exist, co-training algorithms that manufacture a feature split may out-perform algorithms not using a split. These results help explain why co-training algorithms are both discriminative in nature and robust to the assumptions of their embedded classi ers.
TL;DR: A method to learn object class models from unlabeled and unsegmented cluttered cluttered scenes for the purpose of visual object recognition that achieves very good classification results on human faces and rear views of cars.
Abstract: We present a method to learn object class models from unlabeled and unsegmented cluttered scenes for the purpose of visual object recognition. We focus on a particular type of model where objects are represented as flexible constellations of rigid parts (features). The variability within a class is represented by a joint probability density function (pdf) on the shape of the constellation and the output of part detectors. In a first stage, the method automatically identifies distinctive parts in the training set by applying a clustering algorithm to patterns selected by an interest operator. It then learns the statistical shape model using expectation maximization. The method achieves very good classification results on human faces and rear views of cars.
TL;DR: This paper investigates the use of lazy learning and Hausdorff distance to approach the multiple-instance problem, and presents two variants of the K-nearest neighbor algorithm, called Bayesian-KNN and Citation- KNN, solving themultiple- instance problem.
Abstract: As opposed to traditional supervised learning, multiple-instance learning concerns the problem of classifying a bag of instances, given bags that are labeled by a teacher as being overall positive or negative. Current research mainly concentrates on adapting traditional concept learning to solve this problem. In this paper we investigate the use of lazy learning and Hausdorff distance to approach the multiple-instance problem. We present two variants of the K-nearest neighbor algorithm, called Bayesian-KNN and Citation-KNN, solving the multiple-instance problem. Experiments on the Drug discovery benchmark data show that both algorithms are competitive with the best ones conceived in the concept learning framework. Further work includes exploring of a combination of lazy and eager multiple-instance problem classifiers.
TL;DR: A rigorous definition of the problem is given and the crucial role of prior knowledge is put forward, and the important notion of input-dependent regularization is discussed.
Abstract: In this paper, on the one hand, we aim to give a review on literature dealing with the problem of supervised learning aided by additional unlabeled data On the other hand, being a part of the author's first year PhD report, the paper serves as a frame to bundle related work by the author as well as numerous suggestions for potential future work Therefore, this work contains more speculative and partly subjective material than the reader might expect from a literature review We give a rigorous definition of the problem and relate it to supervised and unsupervised learning The crucial role of prior knowledge is put forward, and we discuss the important notion of input-dependent regularization We postulate a number of baseline methods, being algorithms or algorithmic schemes which can more or less straightforwardly be applied to the problem, without the need for genuinely new concepts However, some of them might serve as basis for a genuine method In the literature review, we try to cover the wide variety of (recent) work and to classify this work into meaningful categories We also mention work done on related problems and suggest some ideas towards synthesis Finally, we discuss some caveats and tradeoffs of central importance to the problem
TL;DR: An experimental application to network intrusion detection shows that SmartSifter was able to identify data with high scores that corresponded to attacks, with low computational costs.
Abstract: Outlier detection is a fundamental issue in data mining, specifically in fraud detection, network intrusion detection, network monitoring, etc. SmartSifter is an outlier detection engine addressing this problem from the viewpoint of statistical learning theory. This paper provides a theoretical basis for SmartSifter and empirically demonstrates its effectiveness. SmartSifter detects outliers in an on-line process through the on-line unsupervised learning of a probabilistic model (using a finite mixture model) of the information source. Each time a datum is input SmartSifter employs an on-line discounting learning algorithm to learn the probabilistic model. A score is given to the datum based on the learned model with a high score indicating a high possibility of being a statistical outlier. The novel features of SmartSifter are: (1) it is adaptive to non-stationary sources of data; (2) a score has a clear statistical/information-theoretic meaning; (3) it is computationally inexpensive; and (4) it can handle both categorical and continuous variables. An experimental application to network intrusion detection shows that SmartSifter was able to identify data with high scores that corresponded to attacks, with low computational costs. Further experimental application has identified a number of meaningful rare cases in actual health insurance pathology data from Australia's Health Insurance Commission.
TL;DR: The model is applied in outline fashion to some of the basic phenomena of simple conditioning and, in greater detail, to the phenomena of latent inhibition and perceptual learning.
Abstract: This paper presents a brief, informal outline followed by a formal statement of an elemental associative learning model first described by McLaren, Kaye, and Mackintosh (1989). The model assumes representation of stimuli by sets of elements (i.e., microfeatures) and a set of associative algorithms that incorporate the following: real-time simulation of learning; an error-correcting learning rule; weight decay that distinguishes between transient and permanent associations; and modulation of associative learning that gives high salience to and, hence, promotes rapid learning with novel, unpredicted stimuli and reduces the salience for a stimulus as its error term declines. The model is applied in outline fashion to some of the basic phenomena of simple conditioning and, in greater detail, to the phenomena of latent inhibition and perceptual learning. A detailed account of generalization and discrimination will be provided in a later paper.
TL;DR: Modifications of the Rprop algorithm are introduced that improve its learning speed and the resulting speedup is experimentally shown for a set of neural network learning tasks as well as for artificial error surfaces.
Abstract: The Rprop algorithm proposed by Riedmiller and Braun is one of the best performing first-order learning methods for neural networks. We introduce modifications of the algorithm that improve its learning speed. The resulting speedup is experimentally shown for a set of neural network learning tasks as well as for artificial error surfaces.
TL;DR: A new framework for the general learning problem, and a novel powerful learning method called Support Vector Machine or SVM, which can solve small sample learning problems better are introduced.
Abstract: Data based machine learning covers a wide range of topics from pattern recognition to function regression and density estimation Most of the existing methods are based on traditional statistics, which provides conclusion only for the situation where sample size is tending to infinity So they may not work in practical cases of limited samples Statistical Learning Theory or SLT is a small sample statistics by Vapnik et al, which concerns mainly the statistic principles when samples are limited, especially the properties of learning procedure in such cases SLT provides us a new framework for the general learning problem, and a novel powerful learning method called Support Vector Machine or SVM, which can solve small sample learning problems better It is believed that the study of SLT and SVM is becoming a new hot area in the field of machine learning This review introduces the basic ideas of SLT and SVM, their major characteristics and some current research trends
TL;DR: This work defines principal curves as continuous curves of a given length which minimize the expected squared distance between the curve and points of the space randomly chosen according to a given distribution, making it possible to theoretically analyze principal curve learning from training data and it also leads to a new practical construction.
Abstract: Principal curves have been defined as "self-consistent" smooth curves which pass through the "middle" of a d-dimensional probability distribution or data cloud. They give a summary of the data and also serve as an efficient feature extraction tool. We take a new approach by defining principal curves as continuous curves of a given length which minimize the expected squared distance between the curve and points of the space randomly chosen according to a given distribution. The new definition makes it possible to theoretically analyze principal curve learning from training data and it also leads to a new practical construction. Our theoretical learning scheme chooses a curve from a class of polygonal lines with k segments and with a given total length to minimize the average squared distance over n training points drawn independently. Convergence properties of this learning scheme are analyzed and a practical version of this theoretical algorithm is implemented. In each iteration of the algorithm, a new vertex is added to the polygonal line and the positions of the vertices are updated so that they minimize a penalized squared distance criterion. Simulation results demonstrate that the new algorithm compares favorably with previous methods, both in terms of performance and computational complexity, and is more robust to varying data models.
TL;DR: Experiments show that landmarking selects, with moderate but reasonable level of success, the best performing of a set of learning algorithms.
Abstract: Landmarking is a novel approach to describing tasks in meta-learning. Previous approaches to meta-learning mostly considered only statistics-inspired measures of the data as a source for the definition of metaattributes. Contrary to such approaches, landmarking tries to determine the location of a specific learning problem in the space of all learning problems by directly measuring the performance of some simple and efficient learning algorithms themselves. In the experiments reported we show how such a use of landmark values can help to distinguish between areas of the learning space favouring different learners. Experiments, both with artificial and real-world databases, show that landmarking selects, with moderate but reasonable level of success, the best performing of a set of learning algorithms.
TL;DR: ELSA is used, an evolutionary local selection algorithm that maintains a diverse population of solutions that approximate the Pareto front in a multidimensional objectiv espace and shows promise in identifying the right features and the correct number of clusters.
Abstract: Feature subset selection is an important problem in knowledge discovery, not only for the insight gained from determining relevant modeling variables but also for the improved understandability, scalabilit y, and possibly , accuracy of the resulting models. In this paper w e consider the problem of feature selection for unsupervised learning. A number of heuristic criteria can be used to estimate the quality of clusters built from a giv en featuresubset. Rather than combining such criteria, we use ELSA, an evolutionary local selection algorithm that maintains a diverse population of solutions that approximate the Pareto front in a multidimensional objectiv espace. Eac hevolved solution represents a feature subset and a number of clusters; a standard K-means algorithm is applied to form the given n umber of clusters based on the selected features. Preliminary results on both real and synthetic data show promise in nding P areto-optimal solutions through which we can identify the signi cant features and the correct number of clusters.
TL;DR: A method to learn heterogeneous models of object classes for visual recognition that automatically identifies distinctive features in the training set and learns the set of model parameters using expectation maximization.
Abstract: We propose a method to learn heterogeneous models of object classes for visual recognition. The training images contain a preponderance of clutter and learning is unsupervised. Our models represent objects as probabilistic constellations of rigid parts (features). The variability within a class is represented by a join probability density function on the shape of the constellation and the appearance of the parts. Our method automatically identifies distinctive features in the training set. The set of model parameters is then learned using expectation maximization. When trained on different, unlabeled and unsegmented views of a class of objects, each component of the mixture model can adapt to represent a subset of the views. Similarly, different component models can also "specialize" on sub-classes of an object class. Experiments on images of human heads, leaves from different species of trees, and motor-cars demonstrate that the method works well over a wide variety of objects.
TL;DR: A new method of segmentation, called the scale causal multigrid (SCM) algorithm, has been successfully applied to real sonar images and seems to be well suited to the segmentation of very noisy images.
Abstract: This paper is concerned with hierarchical Markov random field (MRP) models and their application to sonar image segmentation. We present an original hierarchical segmentation procedure devoted to images given by a high-resolution sonar. The sonar image is segmented into two kinds of regions: shadow (corresponding to a lack of acoustic reverberation behind each object lying on the sea-bed) and sea-bottom reverberation. The proposed unsupervised scheme takes into account the variety of the laws in the distribution mixture of a sonar image, and it estimates both the parameters of noise distributions and the parameters of the Markovian prior. For the estimation step, we use an iterative technique which combines a maximum likelihood approach (for noise model parameters) with a least-squares method (for MRF-based prior). In order to model more precisely the local and global characteristics of image content at different scales, we introduce a hierarchical model involving a pyramidal label field. It combines coarse-to-fine causal interactions with a spatial neighborhood structure. This new method of segmentation, called the scale causal multigrid (SCM) algorithm, has been successfully applied to real sonar images and seems to be well suited to the segmentation of very noisy images. The experiments reported in this paper demonstrate that the discussed method performs better than other hierarchical schemes for sonar image segmentation.
TL;DR: In this paper, a hierarchical reinforcement learning architecture is proposed to learn a discrete sequence of sub-goals in a low-dimensional state space for achieving the main goal of the task.
TL;DR: A kind of MDP that models the algorithm selection problem by allowing multiple state transitions is introduced, and the well known Q-learning algorithm is adapted for this case in a way that combines both Monte-Carlo and Temporal Difference methods.
Abstract: Many computational problems can be solved by multiple algorithms, with different algorithms fastest for different problem sizes, input distributions, and hardware characteristics. We consider the problem of algorithm selection: dynamically choose an algorithm to attack an instance of a problem with the goal of minimizing the overall execution time. We formulate the problem as a kind of Markov decision process (MDP), and use ideas from reinforcement learning to solve it. This paper introduces a kind of MDP that models the algorithm selection problem by allowing multiple state transitions. The well known Q-learning algorithm is adapted for this case in a way that combines both Monte-Carlo and Temporal Difference methods. Also, this work uses, and extends in a way to control problems, the Least-Squares Temporal Difference algorithm (LSTD(0)) of Boyan. The experimental study focuses on the classic problems of order statistic selection and sorting. The encouraging results reveal the potential of applying learning methods to traditional computational problems.
TL;DR: An active learning method is presented that uses adaptive resampling in a natural way to signi cantly reduce the size of the required labeled set and generates a classi cation model that achieves the high accuracies possible with current adaptive Resampling methods.
Abstract: Classi cation modeling (a.k.a. supervised learning) is an extremely useful analytical technique for developing predictive and forecasting applications. The explosive growth in data warehousing and internet usage has made large amounts of data potentially available for developing classi cation models. For example, natural language text is widely available in many forms (e.g., electronic mail, news articles, reports, and web page contents). Categorization of data is a common activity which can be automated to a large extent using supervised learning methods. Examples of this include routing of electronic mail, satellite image classi cation, and character recognition. However, these tasks require labeled data sets of su ciently high quality with adequate instances for training the predictive models. Much of the on-line data, particularly the unstructured variety (e.g., text), is unlabeled. Labeling is usually a expensive manual process done by domain experts. Active learning is an approach to solving this problem and works by identifying a subset of the data that needs to be labeled and uses this subset to generate classi cation models. We present an active learning method that uses adaptive resampling in a natural way to signi cantly reduce the size of the required labeled set and generates a classi cation model that achieves the high accuracies possible with current adaptive resampling methods.
TL;DR: A nonlinear extension to independent component analysis is developed that avoids problems with overlearning which would otherwise be severe in unsupervised learning with flexible nonlinear models.
Abstract: In this chapter, a non-linear extension to independent component analysis is developed. The non-linear mapping from source signals to observations is modelled by a multi-layer perceptron network and the distributions of source signals are modelled by mixture-of-Gaussians. The observations are assumed to be corrupted by Gaussian noise and therefore the method is more ade quately described as non-linear independent factor analysis. The non-linear mapping, the source distributions and the noise level are estimated from the data. Bayesian approach to learning avoids problems with overlearning which would otherwise be severe in unsupervised learning with flexible non-linear models.
TL;DR: A segmentation method is described for the face skin of people of any race in real time, in an adaptive and unsupervised way, based on a Gaussian model of the skin color, that will be referred to as Unsupervised and Adaptive Gaussian Skin-Color Model, UAGM.
TL;DR: An adaptive self-organizing color segmentation algorithm and a transductive learning algorithm used to localize human hand in video sequences and color cue and motion cue are integrated in the localization system, in which motion cue is employed to focus the attention of the system.
Abstract: In Proc. Asian Conf. on Computer Vision, Taiwan, 2000 This paper describes an adaptive self-organizing color segmentation algorithm and a transductive learning algorithm used to localize human hand in video sequences. The color distribution at each time frame is approximated by the proposed 1-D self-organizing map (SOM), in which schemes of growing, pruning and merging are facilitated to find an appropriate number of color cluster automatically. Due to the dynamic backgrounds and changing lighting conditions, the distribution of color over time may not be stationary. An algorithm of SOM transduction is proposed to learn the nonstationary color distribution in HSI color space by combining supervised and unsupervised learning paradigms. Color cue and motion cue are integrated in the localization system, in which motion cue is employed to focus the attention of the system. This approach is also applied to other tasks such as human face tracking and color indexing. Our localization system implemented on a SGI O2 R10000 workstation is reliable and efficient at 20-30Hz.
TL;DR: The main concepts of Statistical Learning Theory are overviewed, a framework in which learning from examples can be studied in a principled way and well known as well as emerging learning techniques such as Regularization Networks and Support Vector Machines are discussed.
Abstract: In this paper we first overview the main concepts of Statistical Learning Theory, a framework in which learning from examples can be studied in a principled way. We then briefly discuss well known as well as emerging learning techniques such as Regularization Networks and Support Vector Machines which can be justified in term of the same induction principle.
TL;DR: In this paper, an ICA-based approach is proposed for hyperspectral image analysis, which can be viewed as a random version of the commonly used linear spectral mixture analysis, in which the abundance fractions in a linear mixture model are considered to be unknown independent signal sources.
Abstract: In this paper, an ICA-based approach is proposed for hyperspectral image analysis. It can be viewed as a random version of the commonly used linear spectral mixture analysis, in which the abundance fractions in a linear mixture model are considered to be unknown independent signal sources. It does not require the full rank of the separating matrix or orthogonality as most ICA methods do. More importantly, the learning algorithm is designed based on the independency of the material abundance vector rather than the independency of the separating matrix generally used to constrain the standard ICA. As a result, the learning algorithm is able to converge to non-orthogonal independent components. This is particularly useful in hyperspectral image analysis since many materials extracted from a hyperspectral image may have similar spectral signatures and may not be orthogonal. The AVIRIS experiments have demonstrated that the proposed ICA provides an effective unsupervised technique for hyperspectral image classification.
TL;DR: Flexible discriminant and mixture models Neural networks for unsupervised learning based on information theory Radial basis function networks and statistics Robust prediction in many-parameter models and data visualisation.
Abstract: Flexible discriminant and mixture models Neural networks for unsupervised learning based on information theory Radial basis function networks and statistics Robust prediction in many-parameter models Density networks Latent variable models and data visualisation Analysis of latent structure models with multidimensional latent variables Artificial neural networks and multivariate statistics
TL;DR: Comparison of three machine learning methods on Chinese text categorization reveals that all three methods produce satisfactory performance on the test corpus while ARAM exhibits a marginally better generalization capability, especially from relatively small and noisy training sets.
Abstract: This paper reports our comparative evaluation of three machine learning methods on Chinese text categorization. Whereas a wide range of methods have been applied to English text categorization, relatively few studies have been done on Chinese text categorization. Based on a People's Daily news corpus, a series of controlled experiments evaluate three machine learning methods, namely k Nearest Neighbor (kNN) algorithm, Support Vector Machines (SVM), and Adaptive Resonance Associative Map (ARAM), in terms of their capabilities in mining categorization knowledge from high dimensional, sparse, and relatively noisy document feature vectors. Experiments reveal that all three methods produce satisfactory performance on the test corpus while ARAM exhibits a marginally better generalization capability, especially from relatively small and noisy training sets.
TL;DR: Analysis of patterns of temporal variation in community dynamics was conducted by combining two unsupervised artificial neural networks, the Adaptive Resonance Theory (ART) and the Kohonen network.
TL;DR: A method to learn object class models from unlabeled and unsegmented cluttered cluttered scenes for the purpose of visual object recognition achieves very good classification results on human faces, cars, leaves, handwritten letters, and cartoon characters.
Abstract: A method is presented to learn object class models from unlabeled and unsegmented cluttered scenes for the purpose of visual object recognition. The variability across a class of objects is modeled in a principled way, treating objects as flexible constellations of rigid parts (features). Variability is represented by a joint probability density function (pdf) on the shape of the constellation and the output of part detectors. Corresponding “constellation models” can be learned in a completely unsupervised fashion. In a first stage, the learning method automatically identifies distinctive parts in the training set by applying a clustering algorithm to patterns selected by an interest operator. It then learns the statistical shape model using expectation maximization. Mixtures of constellation models can be defined and applied to “discover” object categories in an unsupervised manner. The method achieves very good classification results on human faces, cars, leaves, handwritten letters, and cartoon characters.
TL;DR: The main claims are that the basic learning and deduction tasks are provably tractable and tractable learning offers viable approaches to a range of issues that have been previously identified as problematic for artificial intelligence systems that are programmed.
Abstract: An architecture is described for designing systems that acquire and ma nipulate large amounts of unsystematized, or so-called commonsense, knowledge. Its aim is to exploit to the full those aspects of computational learning that are known to offer powerful solutions in the acquisition and maintenance of robust knowledge bases. The architecture makes explicit the requirements on the basic computational tasks that are to be performed and is designed to make this computationally tractable even for very large databases. The main claims are that (i) the basic learning and deduction tasks are provably tractable and (ii) tractable learning offers viable approaches to a range of issues that have been previously identified as problematic for artificial intelligence systems that are programmed. Among the issues that learning offers to resolve are robustness to inconsistencies, robustness to incomplete information and resolving among alternatives. Attribute-efficient learning algorithms, which allow learning from few examples in large dimensional systems, are fundamental to the approach. Underpinning the overall architecture is a new principled approach to manipulating relations in learning systems. This approach, of independently quantified arguments, allows propositional learning algorithms to be applied systematically to learning relational concepts in polynomial time and in modular fashion.
TL;DR: Four algorithms for supervised learning, which belong to different families, are compared in a benchmark corpus for the WSD task and both qualitative and quantitative conclusions are drawn.
Abstract: In this report, some collaborative work between the fields of Machine Learning (ML) and Natural Language Processing (NLP)
is presented. The document is structured in two parts. The first part includes a superficial but comprehensive survey covering
the state-of-the-art of machine learning techniques applied to natural language learning tasks. In the second part, a particular
problem, namely Word Sense Disambiguation (WSD), is studied in more detail. In doing so, four algorithms for supervised
learning, which belong to different families, are compared in a benchmark corpus for the WSD task. Both qualitative and
quantitative conclusions are drawn.
TL;DR: On a word sense disambiguation test the model performed better than other state of the art systems for unsupervised learning of selectional preferences and methods for implementing "explaining away" in other graphical frameworks are discussed.
Abstract: This paper presents a Bayesian model for unsupervised learning of verb selectional preferences. For each verb the model creates a Bayesian network whose architecture is determined by the lexical hicrarchy of Wordnet and whose parameters are estimated from a list of verb-object pairs found from a corpus. "Explaining away", a well-known property of Bayesian networks, helps the model deal in a natural fashion with word sense ambiguity in the training data. On a word sense disambiguation test our model performed better than other state of the art systems for unsupervised learning of selectional preferences. Computational complexity problems, ways of improving this approach and methods for implementing "explaining away" in other graphical frameworks are discussed.