Lower-bounding term frequency normalization
Yuanhua Lv,ChengXiang Zhai +1 more
- 24 Oct 2011
- pp 7-16
TL;DR: This paper proposes a general and efficient method to introduce a sufficiently large lower bound for TF normalization which can be shown analytically to fix or alleviate the problem of very long documents being overly penalized.
read more
Abstract: In this paper, we reveal a common deficiency of the current retrieval models: the component of term frequency (TF) normalization by document length is not lower-bounded properly; as a result, very long documents tend to be overly penalized. In order to analytically diagnose this problem, we propose two desirable formal constraints to capture the heuristic of lower-bounding TF, and use constraint analysis to examine several representative retrieval functions. Analysis results show that all these retrieval functions can only satisfy the constraints for a certain range of parameter values and/or for a particular set of query terms. Empirical results further show that the retrieval performance tends to be poor when the parameter is out of the range or the query term is not in the particular set. To solve this common problem, we propose a general and efficient method to introduce a sufficiently large lower bound for TF normalization which can be shown analytically to fix or alleviate the problem. Our experimental results demonstrate that the proposed method, incurring almost no additional computational cost, can be applied to state-of-the-art retrieval functions, such as Okapi BM25, language models, and the divergence from randomness approach, to significantly improve the average precision, especially for verbose queries.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Anserini: Reproducible Ranking Baselines Using Lucene
Peilin Yang,Hui Fang,Jimmy Lin +2 more
TL;DR: Anserini is described, an information retrieval toolkit built on Lucene that allows researchers to easily reproduce results with modern bag-of-words ranking models on diverse test collections and demonstrates that Lucene provides a suitable framework for supporting information retrieval research.
275
Addressing Complex and Subjective Product-Related Queries with Customer Reviews
Julian McAuley,Alex Yang +1 more
- 11 Apr 2016
TL;DR: Moqa, a machine learning system that automatically learns whether a review of a product is relevant to a given query, is evaluated on a novel corpus of 1.4 million questions (and answers) and 13 million reviews to show quantitatively that it is effective at addressing both binary and open-ended queries, and qualitatively thatit surfaces reviews that human evaluators consider to be relevant.
Improvements to BM25 and Language Models Examined
Andrew Trotman,Antti Puurula,Blake Burgess +2 more
- 26 Nov 2014
TL;DR: This investigation finds that once trained (using particle swarm optimization) there is very little difference in performance between these functions, that relevance feedback is effective, that stemming is effective and that it remains unclear which function is best over-all.
•Book
Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining
ChengXiang Zhai,Sean Massung +1 more
- 30 Jun 2016
TL;DR: A systematic introduction to a wide range of statistical and heuristic approaches to management and analysis of text data have been developed over the past few decades, with an emphasis on covering the most useful knowledge and skills required to build a variety of practically useful text information systems.
183
References
A vector space model for automatic indexing
Gerard Salton,A. Wong,C. S. Yang +2 more
TL;DR: An approach based on space density computations is used to choose an optimum indexing vocabulary for a collection of documents, demonstating the usefulness of the model.
A language modeling approach to information retrieval
Jay Ponte,W. Bruce Croft +1 more
- 01 Aug 1998
TL;DR: It will be shown that probabilistic methods can be used to predict topic changes in the context of the task of new event detection and provide further proof of concept for the use of language models for retrieval tasks.
Relevance weighting of search terms
TL;DR: In this article, a series of relevance weighting functions is derived and is justified by theoretical considerations, in particular, it is shown that specific weighted search methods are implied by a general probabilistic theory of retrieval.
2K
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval
ChengXiang Zhai,John Lafferty +1 more
- 01 Sep 2001
TL;DR: This paper examines the sensitivity of retrieval performance to the smoothing parameters and compares several popular smoothing methods on different test collection.
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval
Stephen Robertson,S. Walker +1 more
- 01 Aug 1994
TL;DR: The 2-Poisson model for term frequencies is used to suggest ways of incorporating certain variables in probabilistic models for information retrieval, and substantial performance improvements are demonstrated.
1.5K
Related Papers (5)
Hui Fang,Tao Tao,ChengXiang Zhai +2 more
- 25 Jul 2004
Amit Singhal,Chris Buckley,Manclar Mitra +2 more
- 18 Aug 1996