TL;DR: There are several arguments which support the observed high accuracy of SVMs, which are reviewed and numerous examples and proofs of most of the key theorems are given.
Abstract: The tutorial starts with an overview of the concepts of VC dimension and structural risk minimization. We then describe linear Support Vector Machines (SVMs) for separable and non-separable data, working through a non-trivial example in detail. We describe a mechanical analogy, and discuss when SVM solutions are unique and when they are global. We describe how support vector training can be practically implemented, and discuss in detail the kernel mapping technique which is used to construct SVM solutions which are nonlinear in the data. We show how Support Vector machines can have very large (even infinite) VC dimension by computing the VC dimension for homogeneous polynomial and Gaussian radial basis function kernels. While very high VC dimension would normally bode ill for generalization performance, and while at present there exists no theory which shows that good generalization performance is guaranteed for SVMs, there are several arguments which support the observed high accuracy of SVMs, which we review. Results of some experiments which were inspired by these arguments are also presented. We give numerous examples and proofs of most of the key theorems. There is new material, and I hope that the reader will find that even old material is cast in a fresh light.
TL;DR: This paper explores the use of Support Vector Machines for learning text classifiers from examples and analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task.
Abstract: This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task. Empirical results support the theoretical findings. SVMs achieve substantial improvements over the currently best performing methods and behave robustly over a variety of different learning tasks. Furthermore they are fully automatic, eliminating the need for manual parameter tuning.
TL;DR: A method to construct a decision tree based classifier is proposed that maintains highest accuracy on training data and improves on generalization accuracy as it grows in complexity.
Abstract: Much of previous attention on decision trees focuses on the splitting criteria and optimization of tree sizes. The dilemma between overfitting and achieving maximum accuracy is seldom resolved. A method to construct a decision tree based classifier is proposed that maintains highest accuracy on training data and improves on generalization accuracy as it grows in complexity. The classifier consists of multiple trees constructed systematically by pseudorandomly selecting subsets of components of the feature vector, that is, trees constructed in randomly chosen subspaces. The subspace method is compared to single-tree classifiers and other forest construction methods by experiments on publicly available datasets, where the method's superiority is demonstrated. We also discuss independence between trees in a forest and relate that to the combined classification accuracy.
TL;DR: This issue's collection of essays should help familiarize readers with this interesting new racehorse in the Machine Learning stable, and give a practical guide and a new technique for implementing the algorithm efficiently.
Abstract: My first exposure to Support Vector Machines came this spring when heard Sue Dumais present impressive results on text categorization using this analysis technique. This issue's collection of essays should help familiarize our readers with this interesting new racehorse in the Machine Learning stable. Bernhard Scholkopf, in an introductory overview, points out that a particular advantage of SVMs over other learning algorithms is that it can be analyzed theoretically using concepts from computational learning theory, and at the same time can achieve good performance when applied to real problems. Examples of these real-world applications are provided by Sue Dumais, who describes the aforementioned text-categorization problem, yielding the best results to date on the Reuters collection, and Edgar Osuna, who presents strong results on application to face detection. Our fourth author, John Platt, gives us a practical guide and a new technique for implementing the algorithm efficiently.
TL;DR: The sequential minimal optimization (SMO) algorithm as mentioned in this paper uses a series of smallest possible QP problems to solve a large QP problem, which avoids using a time-consuming numerical QP optimization as an inner loop.
Abstract: This paper proposes a new algorithm for training support vector machines: Sequential Minimal Optimization, or SMO. Training a support vector machine requires the solution of a very large quadratic programming (QP) optimization problem. SMO breaks this large QP problem into a series of smallest possible QP problems. These small QP problems are solved analytically, which avoids using a time-consuming numerical QP optimization as an inner loop. The amount of memory required for SMO is linear in the training set size, which allows SMO to handle very large training sets. Because matrix computation is avoided, SMO scales somewhere between linear and quadratic in the training set size for various test problems, while the standard chunking SVM algorithm scales somewhere between linear and cubic in the training set size. SMO’s computation time is dominated by SVM evaluation, hence SMO is fastest for linear SVMs and sparse data sets. On realworld sparse data sets, SMO can be more than 1000 times faster than the chunking algorithm.
TL;DR: It is shown that techniques used in the analysis of Vapnik's support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error.
Abstract: One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show that this phenomenon is related to the distribution of margins of the training examples with respect to the generated voting classification rule, where the margin of an example is simply the difference between the number of correct votes and the maximum number of votes received by any incorrect label. We show that techniques used in the analysis of Vapnik's support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error. We also show theoretically and experimentally that boosting is especially effective at increasing the margins of the training examples. Finally, we compare our explanation to those based on the bias-variance decomposition.
TL;DR: The Structural Risk Minimization (SRM) as discussed by the authors principle has been shown to be superior to traditional empirical risk minimization (ERM) principle employed by conventional neural networks, as opposed to ERM which minimizes the error on the training data.
Abstract: The foundations of Support Vector Machines (SVM) have been developed by Vapnik and are gaining popularity due to many attractive features, and promising empirical performance. The formulation embodies the Structural Risk Minimisation (SRM) principle, which in our work has been shown to be superior to traditional Empirical Risk Minimisation (ERM) principle employed by conventional neural networks. SRM minimises an upper bound on the VC dimension (generalisation error), as opposed to ERM which minimises the error on the training data. It is this difference which equips SVMs with a greater ability to generalise, which is our goal in statistical learning. SVMs were developed to solve the classification problem, but recently they have been extended to the domain of regression problems.
TL;DR: A comparison of the effectiveness of five different automatic learning algorithms for text categorization in terms of learning speed, realtime classification speed, and classification accuracy is compared.
Abstract: 1. ABSTRACT Text categorization – the assignment of natural language texts to one or more predefined categories based on their content – is an important component in many information organization and management tasks. We compare the effectiveness of five different automatic learning algorithms for text categorization in terms of learning speed, realtime classification speed, and classification accuracy. We also examine training set size, and alternative document representations. Very accurate text classifiers can be learned automatically from training examples. Linear Support Vector Machines (SVMs) are particularly promising because they are very accurate, quick to train, and quick to evaluate. 1.1
TL;DR: A natural way of achieving this combination by deriving kernel functions for use in discriminative methods such as support vector machines from generative probability models is developed.
Abstract: Generative probability models such as hidden Markov models provide a principled way of treating missing information and dealing with variable length sequences. On the other hand, discriminative methods such as support vector machines enable us to construct flexible decision boundaries and often result in classification performance superior to that of the model based approaches. An ideal classifier should combine these two complementary approaches. In this paper, we develop a natural way of achieving this combination by deriving kernel functions for use in discriminative methods such as support vector machines from generative probability models. We provide a theoretical justification for this combination as well as demonstrate a substantial improvement in the classification performance in the context of DNA and protein sequence analysis.
TL;DR: In this article, the authors discuss a strategy for polychotomous classification that involves estimating class probabilities for each pair of classes, and then coupling the estimates together, similar to the Bradley-Terry method for paired comparisons.
Abstract: We discuss a strategy for polychotomous classification that involves estimating class probabilities for each pair of classes, and then coupling the estimates together. The coupling model is similar to the Bradley-Terry method for paired comparisons. We study the nature of the class probability estimates that arise, and examine the performance of the procedure in real and simulated data sets. Classifiers used include linear discriminants, nearest neighbors, adaptive nonlinear methods and the support vector machine.
TL;DR: Numerical tests on 6 public data sets show that classi ers trained by the concave minimization approach and those trained by a support vector machine have comparable 10fold cross-validation correctness.
Abstract: Computational comparison is made between two feature selection approaches for nding a separating plane that discriminates between two point sets in an n-dimensional feature space that utilizes as few of the n features (dimensions) as possible. In the concave minimization approach [19, 5] a separating plane is generated by minimizing a weighted sum of distances of misclassi ed points to two parallel planes that bound the sets and which determine the separating plane midway between them. Furthermore, the number of dimensions of the space used to determine the plane is minimized. In the support vector machine approach [27, 7, 1, 10, 24, 28], in addition to minimizing the weighted sum of distances of misclassi ed points to the bounding planes, we also maximize the distance between the two bounding planes that generate the separating plane. Computational results show that feature suppression is an indirect consequence of the support vector machine approach when an appropriate norm is used. Numerical tests on 6 public data sets show that classi ers trained by the concave minimization approach and those trained by a support vector machine have comparable 10fold cross-validation correctness. However, in all data sets tested, the classi ers obtained by the concave minimization approach selected fewer problem features than those trained by a support vector machine.
TL;DR: A general S3VM model is proposed that minimizes both the misclassification error and the function capacity based on all the available data that can be converted to a mixed-integer program and then solved exactly using integer programming.
Abstract: We introduce a semi-supervised support vector machine (S3VM) method. Given a training set of labeled data and a working set of unlabeled data, S3VM constructs a support vector machine using both the training and working sets. We use S3VM to solve the transduction problem using overall risk minimization (ORM) posed by Vapnik. The transduction problem is to estimate the value of a classification function at the given points in the working set. This contrasts with the standard inductive learning problem of estimating the classification function at all possible values and then using the fixed function to deduce the classes of the working set data. We propose a general S3VM model that minimizes both the misclassification error and the function capacity based on all the available data. We show how the S3VM model for 1-norm linear support vector machines can be converted to a mixed-integer program and then solved exactly using integer programming. Results of S3VM and the standard 1-norm support vector machine approach are compared on ten data sets. Our computational results support the statistical learning theory results showing that incorporating working data improves generalization when insufficient training information is available. In every case, S3VM either improved or showed no significant difference in generalization compared to the traditional approach.
TL;DR: The proposed system does not require feature extraction and performs recognition on images regarded as points of a space of high dimension without estimating pose, indicating that SVMs are well-suited for aspect-based recognition.
Abstract: Support vector machines (SVMs) have been recently proposed as a new technique for pattern recognition. Intuitively, given a set of points which belong to either of two classes, a linear SVM finds the hyperplane leaving the largest possible fraction of points of the same class on the same side, while maximizing the distance of either class from the hyperplane. The hyperplane is determined by a subset of the points of the two classes, named support vectors, and has a number of interesting theoretical properties. In this paper, we use linear SVMs for 3D object recognition. We illustrate the potential of SVMs on a database of 7200 images of 100 different objects. The proposed system does not require feature extraction and performs recognition on images regarded as points of a space of high dimension without estimating pose. The excellent recognition rates achieved in all the performed experiments indicate that SVMs are well-suited for aspect-based recognition.
TL;DR: For the Support Vector method both the quality of solution and the complexity of the solution does not depend directly on the dimensionality of an input space, and on the basis of this technique one can obtain a good estimate using a given number of high-dimensional data.
Abstract: This chapter describes the Support Vector technique for function estimation problems such as pattern recognition, regression estimation, and solving linear operator equations. It shows that for the Support Vector method both the quality of solution and the complexity of the solution does not depend directly on the dimensionality of an input space. Therefore, on the basis of this technique one can obtain a good estimate using a given number of high-dimensional data.
TL;DR: In this article, a probabilistic classifier (e.g., a support vector machine) trained on prior content classifications is used to classify a message into one of a number of different folders, depicted in a pre-defined visually distinctive manner or simply discarded in its entirety.
Abstract: A technique, specifically a method and apparatus that implements the method, which through a probabilistic classifier (370) and, for a given recipient, detects electronic mail (e-mail) messages, in an incoming message stream, which that recipient is likely to consider "junk". Specifically, the invention discriminates message content for that recipient, through a probabilistic classifier (e.g., a support vector machine) trained on prior content classifications. Through a resulting quantitative probability measure, i.e., an output confidence level, produced by the classifier for each message and subsequently compared against a predefined threshold, that message is classified as either, e.g., spam or legitimate mail, and, e.g., then stored in a corresponding folder (223, 227) for subsequent retrieval by and display to the recipient. Based on the probability measure, the message can alternatively be classified into one of a number of different folders, depicted in a pre-defined visually distinctive manner or simply discarded in its entirety.
TL;DR: A result is presented that allows one to trade off errors on the training sample against improved generalization performance, and a more general result in terms of "luckiness" functions, which provides a quite general way for exploiting serendipitous simplicity in observed data to obtain better prediction accuracy from small training sets.
Abstract: The paper introduces some generalizations of Vapnik's (1982) method of structural risk minimization (SRM). As well as making explicit some of the details on SRM, it provides a result that allows one to trade off errors on the training sample against improved generalization performance. It then considers the more general case when the hierarchy of classes is chosen in response to the data. A result is presented on the generalization performance of classifiers with a "large margin". This theoretically explains the impressive generalization performance of the maximal margin hyperplane algorithm of Vapnik and co-workers (which is the basis for their support vector machines). The paper concludes with a more general result in terms of "luckiness" functions, which provides a quite general way for exploiting serendipitous simplicity in observed data to obtain better prediction accuracy from small training sets. Four examples are given of such functions, including the Vapnik-Chervonenkis (1971) dimension measured on the sample.
TL;DR: If the data are noiseless, the modified version of basis pursuit denoising proposed in this article is equivalent to SVM in the following sense: if applied to the same data set, the two techniques give the same solution, which is obtained by solving the same quadratic programming problem.
Abstract: In the first part of this paper we show a similarity between the principle of Structural Risk Minimization Principle (SRM) (Vapnik, 1982) and the idea of Sparse Approximation, as defined in (Chen, Donoho and Saunders, 1995) and Olshausen and Field (1996). Then we focus on two specific (approximate) implementations of SRM and Sparse Approximation, which have been used to solve the problem of function approximation. For SRM we consider the Support Vector Machine technique proposed by V. Vapnik and his team at AT\&T Bell Labs, and for Sparse Approximation we consider a modification of the Basis Pursuit De-Noising algorithm proposed by Chen, Donoho and Saunders (1995). We show that, under certain conditions, these two techniques are equivalent: they give the same solution and they require the solution of the same quadratic programming problem.
TL;DR: A SVM -based face recognition algorithm that is compared with a principal component analysis (PCA) based algorithm on a difficult set of images from the FERET database and generated a similarity metric between faces that is learned from examples of differences between faces.
Abstract: Face recognition is a K class problem. where K is the number of known individuals; and support vector machines (SVMs) are a binary classification method. By reformulating the face recognition problem and reinterpreting the output of the SVM classifier. we developed a SVM -based face recognition algorithm. The face recognition problem is formulated as a problem in difference space. which models dissimilarities between two facial images. In difference space we formulate face recognition as a two class problem. The classes are: dissimilarities between faces of the same person. and dissimilarities between faces of different people. By modifying the interpretation of the decision surface generated by SVM. we generated a similarity metric between faces that is learned from examples of differences between faces. The SVM-based algorithm is compared with a principal component analysis (PCA) based algorithm on a difficult set of images from the FERET database. Performance was measured for both verification and identification scenarios. The identification performance for SVM is 77-78% versus 54% for PCA. For verification. the equal error rate is 7% for SVM and 13% for PCA.
TL;DR: In this paper, a method for predicting a classification of an object given classifications of the objects in the training set, assuming that the pairs object/classification are generated by an i.i.d. process from a continuous probability distribution.
Abstract: We describe a method for predicting a classification of an object given classifications of the objects in the training set, assuming that the pairs object/classification are generated by an i.i.d. process from a continuous probability distribution. Our method is a modification of Vapnik's support-vector machine; its main novelty is that it gives not only the prediction itself but also a practicable measure of the evidence found in support of that prediction. We also describe a procedure for assigning degrees of confidence to predictions made by the support vector machine. Some experimental results are presented, and possible extensions of the algorithms are discussed.
TL;DR: In this paper, a kernel-based framework for pattern recognition, regression estimation, function approximation, and multiple operator inversion is presented, which adopts a regularization-theoretic framework.
Abstract: We present a kernel-based framework for pattern recognition, regression estimation, function approximation, and multiple operator inversion. Adopting a regularization-theoretic framework, the above are formulated as constrained optimization problems. Previous approaches such as ridge regression, support vector methods, and regularization networks are included as special cases. We show connections between the cost function and some properties up to now believed to apply to support vector machines only. For appropriately chosen cost functions, the optimal solution of all the problems described above can be found by solving a simple quadratic programming problem.
TL;DR: This paper proposes an adaptation of the Adatron algorithm for clas-siication with kernels in high dimensional spaces that can find a solution very rapidly with an exponentially fast rate of convergence towards the optimal solution.
Abstract: Support Vector Machines work by mapping training data for classiication tasks into a high dimensional feature space. In the feature space they then nd a maximal margin hyperplane which separates the data. This hyperplane is usually found using a quadratic programming routine which is computation-ally intensive, and is non trivial to implement. In this paper we propose an adaptation of the Adatron algorithm for clas-siication with kernels in high dimensional spaces. The algorithm is simple and can nd a solution very rapidly with an exponentially fast rate of convergence (in the number of iterations) towards the optimal solution. Experimental results with real and artiicial datasets are provided.
TL;DR: This paper describes an approach for the problem of face pose discrimination using support vector machines (SVM), which achieved perfect accuracy-100%-discriminating between the three possible face poses on unseen test data, using either polynomials of degree 3 or radial basis functions (RBF) as kernel approximation functions.
Abstract: This paper describes an approach for the problem of face pose discrimination using support vector machines (SVM). Face pose discrimination means that one can label the face image as one of several known poses. Face images are drawn from the standard FERET database. The training set consists of 150 images equally distributed among frontal, approximately 33.75/spl deg/ rotated left and right poses, respectively, and the test set consists of 450 images again equally distributed among the three different types of poses. SVM achieved perfect accuracy-100%-discriminating between the three possible face poses on unseen test data, using either polynomials of degree 3 or radial basis functions (RBF) as kernel approximation functions.
TL;DR: It is shown that the combined accuracies follow a trend of increase with increasing number of component classifiers, and that with an appropriate subspace dimensionality, the random subspace method can be superior to simple k-nearest-neighbor classification.
Abstract: Recent studies have shown that the random subspace method can be used to create multiple independent tree-classifiers that can be combined to improve accuracy. We apply the procedure to k-nearest-neighbor classifiers and show that it can achieve similar results. We examine the effects of several parameters of the method by experiments using data from a digit recognition problem. We show that the combined accuracies follow a trend of increase with increasing number of component classifiers, and that with an appropriate subspace dimensionality, the method can be superior to simple k-nearest-neighbor classification, The method's superiority is maintained when smaller number of training prototypes are available, i.e., when conventional knn classifiers suffer most heavily from the curse of dimensionality.
TL;DR: A new algorithm for Support Vector regression that automatically adjusts a flexible tube of minimal radius to the data such that at most a fraction of the data points lie outside.
Abstract: A new algorithm for Support Vector regression is described. For a priori chosen ν, it automatically adjusts a flexible tube of minimal radius to the data such that at most a fraction ν of the data points lie outside. Moreover, it is shown how to use parametric tube shapes with non-constant radius. The algorithm is analysed theoretically and experimentally.
TL;DR: SVM light as mentioned in this paper is an implementation of an SVM learner which addresses the problem of large tasks and provides guidelines for the application of SVMs to large domains, which makes large-scale SVM training more practical.
Abstract: Training a support vector machine SVM leads to a quadratic optimization problem with bound constraints and one linear equality constraint. Despite the fact that this type of problem is well understood, there are many issues to be considered in designing an SVM learner. In particular, for large learning tasks with many training examples on the shelf optimization techniques for general quadratic programs quickly become intractable in their memory and time requirements. SVM light is an implementation of an SVM learner which addresses the problem of large tasks. This chapter presents algorithmic and computational results developed for SVM light V 2.0, which make large-scale SVM training more practical. The results give guidelines for the application of SVMs to large domains.
TL;DR: In this procedure model selection and learning are not separate, but kernels are dynamically adjusted during the learning process to find the kernel parameter which provides the best possible upper bound on the generalisation error.
Abstract: The kernel-parameter is one of the few tunable parameters in Support Vector machines, controlling the complexity of the resulting hypothesis. Its choice amounts to model selection and its value is usually found by means of a validation set. We present an algorithm which can automatically perform model selection with little additional computational cost and with no need of a validation set. In this procedure model selection and learning are not separate, but kernels are dynamically adjusted during the learning process to find the kernel parameter which provides the best possible upper bound on the generalisation error. Theoretical results motivating the approach and experimental results confirming its validity are presented.
TL;DR: In this paper, a space vector modulation (SVM)-based hysteresis current controller (HCC) for squirrel cage induction motors is proposed. And the controller determines a set of space vectors from a region detector and applies a vector selected according to the main HCC.
Abstract: In this paper, a novel space vector modulation (SVM)-based hysteresis current controller (HCC) for squirrel cage induction motors is proposed. This technique utilizes all advantages of the HCC and SVM technique. The controller determines a set of space vectors from a region detector and applies a space vector selected according to the main HCC. A set of space vectors including the zero vector to reduce the number of switchings is determined from the sign of the output frequency and output signals of three comparators with a little larger hysteresis band than that of the main HCC. A simple hardware implementation is proposed and experimental results of the SVM-based HCC are also shown.
TL;DR: A new algorithm for Support Vector regression is proposed that automatically adjusts a flexible tube of minimal radius to the data such that at most a fraction of the data points lie outside.
Abstract: A new algorithm for Support Vector regression is proposed. For a priori chosen ν, it automatically adjusts a flexible tube of minimal radius to the data such that at most a fraction ν of the data points lie outside. The algorithm is analysed theoretically and experimentally.
TL;DR: This work has shown that the reconstruction of patterns from their largest nonlinear principal components, a technique which is common practice in linear principal component analysis, can be performed using a kernel without explicitly working in F.
Abstract: Algorithms based on Mercer kernels construct their solutions in terms of expansions in a high-dimensional feature space F. Previous work has shown that all algorithms which can be formulated in terms of dot products in F can be performed using a kernel without explicitly working in F. The list of such algorithms includes support vector machines and nonlinear kernel principal component extraction. So far, however, it did not include the reconstruction of patterns from their largest nonlinear principal components, a technique which is common practice in linear principal component analysis.