Bigram

Topic Tools

Papers published on a yearly basis

1 / 2

Papers

Journal Article•10.1006/CSLA.1999.0128•

An empirical study of smoothing techniques for language modeling

[...]

Stanley F. Chen¹, Joshua T. Goodman²•Institutions (2)

Carnegie Mellon University¹, Microsoft²

01 Oct 1999-Computer Speech & Language

TL;DR: This work surveys the most widely-used algorithms for smoothing models for language n -gram modeling, and presents an extensive empirical comparison of several of these smoothing techniques, including those described by Jelinek and Mercer (1980), and introduces methodologies for analyzing smoothing algorithm efficacy in detail.

...read moreread less

2,097 citations

Proceedings Article•10.18653/V1/P17-1171•

Reading Wikipedia to Answer Open-Domain Questions

[...]

Danqi Chen¹, Adam Fisch², Jason Weston², Antoine Bordes²•Institutions (2)

Stanford University¹, Facebook²

31 Mar 2017

TL;DR: This approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs, indicating that both modules are highly competitive with respect to existing counterparts.

...read moreread less

Abstract: This paper proposes to tackle open-domain question answering using Wikipedia as the unique knowledge source: the answer to any factoid question is a text span in a Wikipedia article. This task of machine reading at scale combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer spans from those articles). Our approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs. Our experiments on multiple existing QA datasets indicate that (1) both modules are highly competitive with respect to existing counterparts and (2) multitask learning using distant supervision on their combination is an effective complete system on this challenging task.

...read moreread less

1,580 citations

Posted Content•

Reading Wikipedia to Answer Open-Domain Questions

[...]

Danqi Chen¹, Adam Fisch², Jason Weston², Antoine Bordes²•Institutions (2)

Stanford University¹, Facebook²

31 Mar 2017-arXiv: Computation and Language

TL;DR: In this paper, a multi-layer recurrent neural network model was proposed to detect answer spans in Wikipedia paragraphs, which combines a search component based on bigram hashing and TF-IDF matching.

...read moreread less

Abstract: This paper proposes to tackle open- domain question answering using Wikipedia as the unique knowledge source: the answer to any factoid question is a text span in a Wikipedia article. This task of machine reading at scale combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer spans from those articles). Our approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs. Our experiments on multiple existing QA datasets indicate that (1) both modules are highly competitive with respect to existing counterparts and (2) multitask learning using distant supervision on their combination is an effective complete system on this challenging task.

...read moreread less

1,469 citations

Proceedings Article•

Baselines and Bigrams: Simple, Good Sentiment and Topic Classification

[...]

Sida I. Wang¹, Christopher D. Manning¹•Institutions (1)

Stanford University¹

8 Jul 2012

TL;DR: It is shown that the inclusion of word bigram features gives consistent gains on sentiment analysis tasks, and a simple but novel SVM variant using NB log-count ratios as feature values consistently performs well across tasks and datasets.

...read moreread less

Abstract: Variants of Naive Bayes (NB) and Support Vector Machines (SVM) are often used as baseline methods for text classification, but their performance varies greatly depending on the model variant, features used and task/dataset. We show that: (i) the inclusion of word bigram features gives consistent gains on sentiment analysis tasks; (ii) for short snippet sentiment tasks, NB actually does better than SVMs (while for longer documents the opposite result holds); (iii) a simple but novel SVM variant using NB log-count ratios as feature values consistently performs well across tasks and datasets. Based on these observations, we identify simple NB and SVM variants which outperform most published results on sentiment analysis datasets, sometimes providing a new state-of-the-art performance level.

...read moreread less

1,354 citations

Proceedings Article•10.1145/1143844.1143967•

Topic modeling: beyond bag-of-words

[...]

Hanna Wallach¹•Institutions (1)

University of Cambridge¹

25 Jun 2006

TL;DR: A hierarchical generative probabilistic model that incorporates both n-gram statistics and latent topic variables by extending a unigram topic model to include properties of a hierarchical Dirichlet bigram language model is explored.

...read moreread less

Abstract: Some models of textual corpora employ text generation methods involving n-gram statistics, while others use latent topic variables inferred using the "bag-of-words" assumption, in which word order is ignored. Previously, these methods have not been combined. In this work, I explore a hierarchical generative probabilistic model that incorporates both n-gram statistics and latent topic variables by extending a unigram topic model to include properties of a hierarchical Dirichlet bigram language model. The model hyperparameters are inferred using a Gibbs EM algorithm. On two data sets, each of 150 documents, the new model exhibits better predictive accuracy than either a hierarchical Dirichlet bigram language model or a unigram topic model. Additionally, the inferred topics are less dominated by function words than are topics discovered using unigram statistics, potentially making them more meaningful.

...read moreread less

1,320 citations

...

Expand

Year	Papers
2025	20
2024	41
2023	85
2022	164
2021	71
2020	70

Topic Tools

Papers published on a yearly basis

Papers

An empirical study of smoothing techniques for language modeling

Reading Wikipedia to Answer Open-Domain Questions

Reading Wikipedia to Answer Open-Domain Questions

Baselines and Bigrams: Simple, Good Sentiment and Topic Classification

Topic modeling: beyond bag-of-words

Related Topics (5)

Performance Metrics