Lower-bounding term frequency normalization

doi:10.1145/2063576.2063584

Open AccessProceedings Article10.1145/2063576.2063584

Lower-bounding term frequency normalization

Yuanhua Lv, +1 more

- 24 Oct 2011

- pp 7-16

165

TL;DR: This paper proposes a general and efficient method to introduce a sufficiently large lower bound for TF normalization which can be shown analytically to fix or alleviate the problem of very long documents being overly penalized.

Abstract: In this paper, we reveal a common deficiency of the current retrieval models: the component of term frequency (TF) normalization by document length is not lower-bounded properly; as a result, very long documents tend to be overly penalized. In order to analytically diagnose this problem, we propose two desirable formal constraints to capture the heuristic of lower-bounding TF, and use constraint analysis to examine several representative retrieval functions. Analysis results show that all these retrieval functions can only satisfy the constraints for a certain range of parameter values and/or for a particular set of query terms. Empirical results further show that the retrieval performance tends to be poor when the parameter is out of the range or the query term is not in the particular set. To solve this common problem, we propose a general and efficient method to introduce a sufficiently large lower bound for TF normalization which can be shown analytically to fix or alleviate the problem. Our experimental results demonstrate that the proposed method, incurring almost no additional computational cost, can be applied to state-of-the-art retrieval functions, such as Okapi BM25, language models, and the divergence from randomness approach, to significantly improve the average precision, especially for verbose queries.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1145/3239571

Anserini: Reproducible Ranking Baselines Using Lucene

Peilin Yang, +2 more

- 29 Oct 2018

- Journal of Data and Information Quality

TL;DR: Anserini is described, an information retrieval toolkit built on Lucene that allows researchers to easily reproduce results with modern bag-of-words ranking models on diverse test collections and demonstrates that Lucene provides a suitable framework for supporting information retrieval research.

...read moreread less

275

•Proceedings Article•10.1145/2872427.2883044

Addressing Complex and Subjective Product-Related Queries with Customer Reviews

Julian McAuley, +1 more

- 11 Apr 2016

TL;DR: Moqa, a machine learning system that automatically learns whether a review of a product is relevant to a given query, is evaluated on a novel corpus of 1.4 million questions (and answers) and 13 million reviews to show quantitatively that it is effective at addressing both binary and open-ended queries, and qualitatively thatit surfaces reviews that human evaluators consider to be relevant.

...read moreread less

228

Proceedings Article•10.1145/2682862.2682863

Improvements to BM25 and Language Models Examined

Andrew Trotman, +2 more

- 26 Nov 2014

TL;DR: This investigation finds that once trained (using particle swarm optimization) there is very little difference in performance between these functions, that relevance feedback is effective, that stemming is effective and that it remains unclear which function is best over-all.

...read moreread less

214

•Book

Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining

ChengXiang Zhai, +1 more

- 30 Jun 2016

TL;DR: A systematic introduction to a wide range of statistical and heuristic approaches to management and analysis of text data have been developed over the past few decades, with an emphasis on covering the most useful knowledge and skills required to build a variety of practically useful text information systems.

...read moreread less

183

•Book Chapter•10.1145/2915031.2915050

Opinion Mining and Sentiment Analysis

ChengXiang Zhai, +1 more

- 23 Jun 2016

76

...

Expand

References

•Journal Article•10.1145/361219.361220

A vector space model for automatic indexing

Gerard Salton, +2 more

- 01 Nov 1975

- Communications of The ACM

TL;DR: An approach based on space density computations is used to choose an optimum indexing vocabulary for a collection of documents, demonstating the usefulness of the model.

...read moreread less

7.9K

Journal Article•10.1145/3130348.3130368

A language modeling approach to information retrieval

Jay Ponte, +1 more

- 01 Aug 1998

TL;DR: It will be shown that probabilistic methods can be used to predict topic changes in the context of the task of new event detection and provide further proof of concept for the use of language models for retrieval tasks.

...read moreread less

2.8K

Journal Article•10.1002/ASI.4630270302

Relevance weighting of search terms

Stephen Robertson, +1 more

- 01 May 1976

- Journal of the Association for Informati...

TL;DR: In this article, a series of relevance weighting functions is derived and is justified by theoretical considerations, in particular, it is shown that specific weighted search methods are implied by a general probabilistic theory of retrieval.

...read moreread less

2K

Journal Article•10.1145/3130348.3130377

A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval

ChengXiang Zhai, +1 more

- 01 Sep 2001

TL;DR: This paper examines the sensitivity of retrieval performance to the smoothing parameters and compares several popular smoothing methods on different test collection.

...read moreread less

1.6K

•Proceedings Article•10.5555/188490.188561

Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

Stephen Robertson, +1 more

- 01 Aug 1994

TL;DR: The 2-Poisson model for term frequencies is used to suggest ways of incorporating certain variables in probabilistic models for information retrieval, and substantial performance improvements are demonstrated.

...read moreread less

1.5K