TL;DR: It is shown that techniques used in the analysis of Vapnik's support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error.
Abstract: One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show that this phenomenon is related to the distribution of margins of the training examples with respect to the generated voting classification rule, where the margin of an example is simply the difference between the number of correct votes and the maximum number of votes received by any incorrect label. We show that techniques used in the analysis of Vapnik's support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error. We also show theoretically and experimentally that boosting is especially effective at increasing the margins of the training examples. Finally, we compare our explanation to those based on the bias-variance decomposition.
TL;DR: In this paper, the authors show that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero.
Abstract: One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show that this phenomenon is related to the distribution of margins of the training examples with respect to the generated voting classification rule, where the margin of an example is simply the difference between the number of correct votes and the maximum number of votes received by any incorrect label. We show that techniques used in the analysis of Vapnik's support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error. We also show theoretically and experimentally that boosting is especially effective at increasing the margins of the training examples. Finally, we compare our explanation to those based on the bias-variance
TL;DR: This paper proposes a simple yet effective method for detecting any abnormal samples, which is applicable to any pre-trained softmax neural classifier, and obtains the class conditional Gaussian distributions with respect to (low- and upper-level) features of the deep models under Gaussian discriminant analysis.
Abstract: Detecting test samples drawn sufficiently far away from the training distribution statistically or adversarially is a fundamental requirement for deploying a good classifier in many real-world machine learning applications. However, deep neural networks with the softmax classifier are known to produce highly overconfident posterior distributions even for such abnormal samples. In this paper, we propose a simple yet effective method for detecting any abnormal samples, which is applicable to any pre-trained softmax neural classifier. We obtain the class conditional Gaussian distributions with respect to (low- and upper-level) features of the deep models under Gaussian discriminant analysis, which result in a confidence score based on the Mahalanobis distance. While most prior methods have been evaluated for detecting either out-of-distribution or adversarial samples, but not both, the proposed method achieves the state-of-the-art performances for both cases in our experiments. Moreover, we found that our proposed method is more robust in harsh cases, e.g., when the training dataset has noisy labels or small number of samples. Finally, we show that the proposed method enjoys broader usage by applying it to class-incremental learning: whenever out-of-distribution samples are detected, our classification rule can incorporate new classes well without further training deep models.
TL;DR: One such classification rule is described which makes use of a neighbor weighting function for the purpose of assigning a class to an unclassified sample.
Abstract: Among the simplest and most intuitively appealing classes of nonprobabilistic classification procedures are those that weight the evidence of nearby sample observations most heavily. More specifically, one might wish to weight the evidence of a neighbor close to an unclassified observation more heavily than the evidence of another neighbor which is at a greater distance from the unclassified observation. One such classification rule is described which makes use of a neighbor weighting function for the purpose of assigning a class to an unclassified sample. The admissibility of such a rule is also considered.
TL;DR: In this article, it was shown that the bias and variance components of the estimation error combine to influence classification in a very different way than with squared error on the probabilities themselves, and that certain types of (very high) bias can be canceled by low variance to produce accurate classification.
Abstract: The classification problem is considered in which an output variable y assumes discrete values with respective probabilities that depend upon the simultaneous values of a set of input variables x = {x_1,....,x_n}. At issue is how error in the estimates of these probabilities affects classification error when the estimates are used in a classification rule. These effects are seen to be somewhat counter intuitive in both their strength and nature. In particular the bias and variance components of the estimation error combine to influence classification in a very different way than with squared error on the probabilities themselves. Certain types of (very high) bias can be canceled by low variance to produce accurate classification. This can dramatically mitigate the effect of the bias associated with some simple estimators like “naive” Bayes, and the bias induced by the curse-of-dimensionality on nearest-neighbor procedures. This helps explain why such simple methods are often competitive with and sometimes superior to more sophisticated ones for classification, and why “bagging/aggregating” classifiers can often improve accuracy. These results also suggest simple modifications to these procedures that can (sometimes dramatically) further improve their classification performance.