Large-Scale Bayesian Logistic Regression for Text Categorization

Question

1. What contributions have the authors mentioned in the paper "Large-scale bayesian logistic regression for text categorization" ?

2. What are the two state-of-the-art approaches used as benchmarks?

3. How many documents were used in the first experiment?

4. How many times did the authors test the Laplace hyperparameter j?

Accepted Answer

The authors present a simple Bayesian logistic regression approach that uses a Laplace prior to avoid overfitting and produces sparse predictive models for text data.. The authors apply this approach to a range of document classification problems and show that it produces compact predictive models at least as effective as those produced by support vector machine classifiers or ridge logistic regression combined with feature selection.. The authors describe their model fitting algorithm, their open source implementations ( BBR and BMR ), and experimental results.

Accepted Answer

The authors also discuss two state-of-the-art text categorization approaches used as benchmarks: support vector machines (SVMs), and ridge logistic regression combined with feature selection.

Accepted Answer

for efficiency the authors took a fixed, random, roughly 10% subset (77,993 documents) of the test documents as their test set in all experiments.

Accepted Answer

In each run the authors tested values for the Laplace hyperparameter λj from the range 0.01–316 by multiples of √ 10, or values for the Gaussian hyperparameter τj from the range .0001–10,000 by multiples of 10.

Accepted Answer

The βnewj that gives the minimum value of g(·) does not have a closed form, so an optimization procedure must be used even for this one-dimensional problem.

Accepted Answer

To test the significance of differences among the algorithms, the authors looked at the difference in per-category F1 values between pairs of algorithms and applied the two-tailed Wilcoxon matched-pairs signed-ranks test (Table 2).

Accepted Answer

The authors used the LYRL2004 training/test split (Lewis et al. 2004) of RCV1-v2, which contains 23,149 training documents and 781,265 test documents.

Accepted Answer

Computing the maximum likelihood fit of a logistic regression model on these data sets is often impossible, since standard software relies on matrix inversion.

Accepted Answer

The posterior density for β with the logistic link on data set D isL(β) = p(β|D) ∝ (n∏ i=111 + exp(−βTxiyi)) p(β), (7)where p(β) is the prior on β and i indexes the training examples in D. For Gaussian priors with mean 0 and variance τ on the βj’s, the log posterior (ignoring the normalizing constant) is given by l(β) = − n∑i=1 ln( 1 + exp(−βTxiyi) )− d∑j=1( ln√ τ j + ln 2π2 + β2j2τj) , (8)and for Laplace priors with mean 0 and variance 2/λ2j , the authors have l(β) = − n∑i=1 ln( 1 + exp(−βTxiyi) )− d∑j=1 (ln 2 − lnλj + λj|βj|), (9)with j = 1, . . . ,d indexing the features in both cases.

Accepted Answer

The authors sometimes report the unweighted arithmetic mean of F1 across the categories for a particular test collection, that is, the macroaveraged F1.Logistic regression models produce an estimate of p(y = +1|β,xi), the probability that vector xi belongs to the category of interest.

Accepted Answer

Most feature selection measures used in text classification (including the two just discussed) take into account only the presence or absence of terms in documents.

Large-Scale Bayesian Logistic Regression for Text Categorization

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What contributions have the authors mentioned in the paper "Large-scale bayesian logistic regression for text categorization" ?

2. What are the two state-of-the-art approaches used as benchmarks?

3. How many documents were used in the first experiment?

4. How many times did the authors test the Laplace hyperparameter j?

5. What is the newj that gives the minimum value of g()?

6. What test did the authors use to test the significance of differences among the algorithms?

7. What data set was used for the LYRL2004 training/test split?

8. What is the way to calculate the maximum likelihood of a logistic regression model?

9. What is the posterior density for with the logistic link on data set D?

10. What is the probability of a vector belonging to the class of interest?

11. What is the measure of feature selection in a recent study?

Figures

Citations

The WEKA data mining software: an update

Regularization Paths for Generalized Linear Models via Coordinate Descent

Nearly unbiased variable selection under minimax concave penalty

Regression shrinkage and selection via the lasso: a retrospective

The Development and Psychometric Properties of LIWC2015

References

Regression Shrinkage and Selection via the Lasso

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Regularization and variable selection via the elastic net

Neural networks for pattern recognition

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Related Papers (5)

Regression Shrinkage and Selection via the Lasso

Regularization Paths for Generalized Linear Models via Coordinate Descent

Regularization and variable selection via the elastic net

Least angle regression

Text Categorization with Suport Vector Machines: Learning with Many Relevant Features