TL;DR: The MSVM is proposed, which extends the binary SVM to the multicategory case and has good theoretical properties, and an approximate leave-one-out cross-validation function is derived, analogous to the binary case.
Abstract: Two-category support vector machines (SVM) have been very popular in the machine learning community for classification problems. Solving multicategory problems by a series of binary classifiers is quite common in the SVM paradigm; however, this approach may fail under various circumstances. We propose the multicategory support vector machine (MSVM), which extends the binary SVM to the multicategory case and has good theoretical properties. The proposed method provides a unifying framework when there are either equal or unequal misclassification costs. As a tuning criterion for the MSVM, an approximate leave-one-out cross-validation function, called Generalized Approximate Cross Validation, is derived, analogous to the binary case. The effectiveness of the MSVM is demonstrated through the applications to cancer classification using microarray data and cloud classification with satellite radiance profiles.
TL;DR: A software system GEMS (Gene Expression Model Selector) that automates high-quality model construction and enforces sound optimization and performance estimation procedures is developed, the first such system to be informed by a rigorous comparative analysis of the available algorithms and datasets.
Abstract: Motivation: Cancer diagnosis is one of the most important emerging clinical applications of gene expression microarray technology. We are seeking to develop a computer system for powerful and reliable cancer diagnostic model creation based on microarray data. To keep a realistic perspective on clinical applications we focus on multicategory diagnosis. To equip the system with the optimum combination of classifier, gene selection and cross-validation methods, we performed a systematic and comprehensive evaluation of several major algorithms for multicategory classification, several gene selection methods, multiple ensemble classifier methods and two cross-validation designs using 11 datasets spanning 74 diagnostic categories and 41 cancer types and 12 normal tissue types.
Results: Multicategory support vector machines (MC-SVMs) are the most effective classifiers in performing accurate cancer diagnosis from gene expression data. The MC-SVM techniques by Crammer and Singer, Weston and Watkins and one-versus-rest were found to be the best methods in this domain. MC-SVMs outperform other popular machine learning algorithms, such as k-nearest neighbors, backpropagation and probabilistic neural networks, often to a remarkable degree. Gene selection techniques can significantly improve the classification performance of both MC-SVMs and other non-SVM learning algorithms. Ensemble classifiers do not generally improve performance of the best non-ensemble models. These results guided the construction of a software system GEMS (Gene Expression Model Selector) that automates high-quality model construction and enforces sound optimization and performance estimation procedures. This is the first such system to be informed by a rigorous comparative analysis of the available algorithms and datasets.
Availability: The software system GEMS is available for download from http://www.gems-system.org for non-commercial use.
Contact: alexander.statnikov@vanderbilt.edu
TL;DR: In this article, the authors explore the use of the zero-norm of the parameters of linear models in learning and derive a simple but practical method for variable or feature selection, minimizing training error and ensuring sparsity in solutions.
Abstract: We explore the use of the so-called zero-norm of the parameters of linear models in learning. Minimization of such a quantity has many uses in a machine learning context: for variable or feature selection, minimizing training error and ensuring sparsity in solutions. We derive a simple but practical method for achieving these goals and discuss its relationship to existing techniques of minimizing the zero-norm. The method boils down to implementing a simple modification of vanilla SVM, namely via an iterative multiplicative rescaling of the training data. Applications we investigate which aid our discussion include variable and feature selection on biological microarray data, and multicategory classification.
TL;DR: In this paper, a multivariate probit model is proposed to predict multicategory choice in a variety of contexts such as choice of multiple categories during a shopping trip or mail-order purchasing.
Abstract: Consumers make multicategory decisions in a variety of contexts such as choice of multiple categories during a shopping trip or mail-order purchasing. The choice of one category may affect the selection of another category due to the complementary nature (e.g., cake mix and cake frosting) of the two categories. Alternatively, two categories may co-occur in a shopping basket not because they are complementary but because of similar purchase cycles (e.g., beer and diapers) or because of a host of other unobserved factors. While complementarity gives managers some control over consumers' buying behavior (e.g., a change in the price of cake mix could change the purchase probability of cake frosting), co-occurrence or co-incidence is less controllable. Other factors that may affect multi-category choice may be (unobserved) household preferences or (observed) household demographics. We also argue that not accounting for these three factors simultaneously could lead to erroneous inferences. We then develop a conceptual framework that incorporates complementarity, co-incidence and heterogeneity (both observed and unobserved) as the factors that could lead to multi-category choice. We then translate this framework into a model of multi-category choice. Our model is based on random utility theory and allows for simultaneous, interdependent choice of many items. This model, the multivariate probit model, is implemented in a Hierarchical Bayes framework. The hierarchy consists of three levels. The first level captures the choice of items for the shopping basket during a shopping trip. The second level captures differences across households and the third level specifies the priors for the unknown parameters. We generalize some recent advances in Markov chain Monte Carlo methods in order to estimate the model. Specifically, we use a substitution sampler which incorporates techniques such as the Metropolis Hit-and-Run algorithm and the Gibbs Sampler. The model is estimated on four categories (cake mix, cake frosting, fabric detergent and fabric softener) using multicategory panel data. The results disentangle the complementarity and co-incidence effects. The complementarity results show that pricing and promotional changes in one category affect purchase incidence in related product categories. In general, the cross-price and cross-promotion effects are smaller than the own-price and own-promotions effects. The cross-effects are also asymmetric across pairs of categories, i.e., related category pairs may be characterized as having a "primary" and a "secondary" category. Thus these results provide a more complete description of the effects of promotional changes by examining them both within and across categories. The co-incidence results show the extent of the relationship between categories that arises from uncontrollable and unobserved factors. These results are useful since they provide insights into a general structure of dependence relationships across categories. The heterogeneity results show that observed demographic factors such as family size influence the intrinsic category preference of households. Larger family sizes also tend to make households more price sensitive for both the primary and secondary categories. We find that price sensitivities across categories are not highly correlated at the household level. We also find some evidence that intrinsic preferences for cake mix and cake frosting are more closely related than preferences for fabric detergent and fabric softener. We compare our model with a series of null models using both estimation and holdout samples. We show that both complementarity and co-incidence play a significant role inpredicting multicategory choice. We also show how many single-category models used in conjunction may not be good predictors of joint choice. Our results are likely to be of interest to retailers and manufacturers trying to optimize pricing and promotion strategies across many categories as well as in designing micromarketing strategies. We illustrate some of these benefits by carrying out an analysis which shows that the "true" impact of complementarity and co-incidence on profitability is significant in a retail setting. Our model can also be applied to other domains. The combination of item interdependence and individual household level estimates may be of particular interest to database marketers in building customized "cross-selling" strategies in the direct mail and financial service industries.
TL;DR: In this paper, the authors proposed new tests of independence based on canonical correlations from dynamically augmented reduced rank regressions for three-way or higher order contingency tables, which allow for an arbitrary number of categories as well as multiway tables of arbitrary dimension.
Abstract: The contingency table literature on tests for dependence among discrete multicategory variables is extensive. Standard tests assume, however, that draws are independent and only limited results exist on the effect of serial dependency—a problem that is important in areas such as economics, finance, medical trials, and meteorology. This article proposes new tests of independence based on canonical correlations from dynamically augmented reduced rank regressions. The tests allow for an arbitrary number of categories as well as multiway tables of arbitrary dimension and are robust in the presence of serial dependencies that take the form of finite-order Markov processes. For three-way or higher order tables we propose new tests of joint and marginal independence. Monte Carlo experiments show that the proposed tests have good finite sample properties. An empirical application to microeconomic survey data on firms' forecasts of changes to their production and prices demonstrates the importance of correcting fo...