Open AccessBook Chapter10.1007/978-3-662-53580-6_7

Text Classification Using Anti-Bayesian Quantile Statistics-Based Classifiers

- 01 Sep 2016

- Vol. 25, pp 101-126

TL;DR: The power and potential of CMQS is demonstrated to describe the very high-dimensional Text Classification TC-related vector spaces in terms of a limited number of "outlier-based" statistics.

Abstract: The problem of Text Classification TC has been studied for decades, and this problem is particularly interesting because the features are derived from syntactic or semantic indicators, while the classification, in and of itself, is based on statistical Pattern Recognition PR strategies. Thus, all the recorded TC schemes work using the fundamental paradigm that once the statistical features are inferred from the syntactic/semantic indicators, the classifiers themselves are the well-established ones such as the Bayesian, the Naive Bayesian, the SVM etc. and those that are neural or fuzzy. In this paper, we shall demonstrate that by virtue of the skewed distributions of the features, one could advantageously work with information latent in certain "non-central" quantiles i.e., those distant from the mean of the distributions. We, indeed, demonstrate that such classifiers exist and are attainable, and show that the design and implementation of such schemes work with the recently-introduced paradigm of Quantile Statistics QS-based classifiersThe foundational properties for CMQS for generic and some straightforward distributions were initially described in [17]. Their properties for uni-dimensional distributions of the exponential family are included in [9], and for multi-dimensional distributions in [18]. The authors of [17],i¾?[9] and [18] had initially proposed their results as being based on the Order-Statistics of the distributions. This was later corrected in [19], where they showed that their results were rather based on their Quantile Statistics.. These classifiers, referred to as Classification by Moments of Quantile Statistics CMQS, are essentially "Anti"-Bayesian in their modus operandi. To achieve our goal, in this paper we demonstrate the power and potential of CMQS to describe the very high-dimensional TC-related vector spaces in terms of a limited number of "outlier-based" statistics. Thereafter, the PR task in classification invokes the CMQS classifier for the underlying multi-class problem by using a linear number of pair-wise CMQS-based classifiers. By a rigorous testing on the standard 20-Newsgroups corpus we show that CMQS-based TC attains accuracy that is comparable to the best-reported classifiers. We also propose the potential of fusing the results of a CMQS-based methodology with those obtained from a more traditional scheme.

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Most frequently asked questions

1. What are the contributions in "Text classification using “anti”-bayesian quantile statistics-based classifiers∗" ?

The problem of Text Classification ( TC ) has been studied for decades, and this problem is particularly interesting because the features are derived from syntactic or semantic indicators, while the classification, in and of itself, is based on statistical Pattern Recognition ( PR ) strategies.. In this paper, the authors shall demonstrate that by virtue of the skewed distributions of the features, one could advantageously work with information latent in certain “ non-central ” quantiles ( i. e., those distant from the mean ) of the distributions.. The authors, indeed, demonstrate that such classifiers exist and are attainable, and show that the design and implementation of such schemes work with the recently-introduced paradigm of Quantile Statistics ( QS ) -based classifiers.. To achieve their goal, in this paper the authors demonstrate the power and potential of CMQS to describe the very high-dimensional TC-related vector spaces in terms of a limited number of “ outlier-based ” statistics.. By a rigorous testing on the standard 20Newsgroups corpus the authors show that CMQS-based TC attains accuracy that is comparable to the ∗The authors are grateful for the partial support provided by NSERC, the Natural Sciences and Engineering Research Council of Canada.. A preliminary version of this paper was presented at ICCCI ’ 15, the 2015 International Conference on Computational Collective Intelligence Technologies and Applications, in Madrid, Spain, in September 2015.. The paper was a Plenary/Keynote Talk at the conference.. The authors of [ 17 ], [ 9 ] and [ 18 ] had initially proposed their results as being based on the Order-Statistics of the distributions.

Table 1: The topics from the “20-Newsgroups” used in the experiments.

Table 5: The correlation between the different classifiers for the 100 classifications achieved. In the case of the “Anti”-Bayesian scheme, the method used the TFIDF features.

Figure 4: Plots of the correlation between the different classifiers for the 100 classifications achieved. In the case of the “Anti”-Bayesian scheme, the method used the TFIDF features.

Figure 5: The correlation between the two “Anti”-Bayesian classifiers for the 100 classifications when they utilized the TF and the TFIDF features respectively.

Figure 1: Example of the QS-based features extracted from the histogram of a lower class (light grey) and of a higher class (dark grey), and the corresponding lower and higher CMQS points of each class.

Table 3: The correlation between the different classifiers for the 100 classifications achieved. In the case of the “Anti”-Bayesian scheme, the method used the TF features.

References

•Book

Introduction to Modern Information Retrieval

Gerard Salton, +1 more

- 01 Jan 1983

TL;DR: Reading is a need and a hobby at once and this condition is the on that will make you feel that you must read.

...read moreread less

12.6K

•Journal Article•10.1016/0306-4573(88)90021-0

Term Weighting Approaches in Automatic Text Retrieval

Gerard Salton, +1 more

- 01 Aug 1988

- Information Processing and Management

TL;DR: This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.

...read moreread less

10.5K

•Journal Article•10.1145/505282.505283

Machine learning in automated text categorization

Fabrizio Sebastiani

- 01 Mar 2002

- ACM Computing Surveys

TL;DR: This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

...read moreread less

8.5K

•Journal Article•10.1145/361219.361220

A vector space model for automatic indexing

Gerard Salton, +2 more

- 01 Nov 1975

- Communications of The ACM

TL;DR: An approach based on space density computations is used to choose an optimum indexing vocabulary for a collection of documents, demonstating the usefulness of the model.

...read moreread less

7.9K

Journal Article•10.1016/0306-4573(83)90062-6

Introduction to modern information retrieval: G. Salton and M. McGill. McGraw-Hill, New York (1983). xv + 448 pp., $32.95 ISBN 0-07-054484-0

Martin Dillon

- 01 Jan 1983

- Information Processing and Management

5.4K

...

Expand

Text Classification Using Anti-Bayesian Quantile Statistics-Based Classifiers

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What are the contributions in "Text classification using “anti”-bayesian quantile statistics-based classifiers∗" ?

Figures

References

Introduction to Modern Information Retrieval

Term Weighting Approaches in Automatic Text Retrieval

Machine learning in automated text categorization

A vector space model for automatic indexing

Introduction to modern information retrieval: G. Salton and M. McGill. McGraw-Hill, New York (1983). xv + 448 pp., $32.95 ISBN 0-07-054484-0

Related Papers (5)

Text Classification Using Novel “ Anti -Bayesian” Techniques

On some classifiers based on multivariate ranks

A Dynamic Logistic Model for Combining Classifier Outputs

Information theory and classification error in probabilistic classifiers

Depth-weighted Bayes classification