Efficient algorithms for document retrieval problems
S. Muthukrishnan
- 06 Jan 2002
- pp 657-666
294
TL;DR: This paper considers document retrieval problems that are motivated by online query processing in databases, Information Retrieval systems and Computational Biology, and provides the first known optimal algorithm for the document listing problem.
read more
Abstract: We are given a collection D of text documents d1,…,dk, with ∑i = n, which may be preprocessed. In the document listing problem, we are given an online query comprising of a pattern string p of length m and our goal is to return the set of all documents that contain one or more copies of p. In the closely related occurrence listing problem, we output the set of all positions within the documents where pattern p occurs. In 1973, Weiner [24] presented an algorithm with O(n) time and space preprocessing following which the occurrence listing problem can be solved in time O(m + output) where output is the number of positions where p occurs; this algorithm is clearly optimal. In contrast, no optimal algorithm is known for the closely related document listing problem, which is perhaps more natural and certainly well-motivated.We provide the first known optimal algorithm for the document listing problem. More generally, we initiate the study of pattern matching problems that require retrieving documents matched by the patterns; this contrasts with pattern matching problems that have been studied more frequently, namely, those that involve retrieving all occurrences of patterns. We consider document retrieval problems that are motivated by online query processing in databases, Information Retrieval systems and Computational Biology. We present very efficient (optimal) algorithms for our document retrieval problems. Our approach for solving such problems involve performing "local" encodings whereby they are reduced to range query problems on geometric objects --- points and lines --- that have color. We present improved algorithms for these colored range query problems that arise in our reductions using the structural properties of strings. This approach is quite general and yields simple, efficient, implementable algorithms for all the document retrieval problems in this paper.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays
Johannes Fischer,Volker Heun +1 more
TL;DR: This work builds a data structure that allows us to answer efficiently subsequent on-line queries of the form “what is the position of a minimum element in the subarray ranging from $i to $j$?”
312
Succinct data structures for flexible text retrieval systems
TL;DR: This work proposes succinct data structures for text retrieval systems supporting document listing queries and ranking queries based on the tf*idf (term frequency times inverse document frequency) scores of documents using small space.
214
Wavelet trees for all
Gonzalo Navarro
- 03 Jul 2012
TL;DR: This survey gives an overview of wavelet trees and the surprising number of applications in which they are useful: basic and weighted point grids, sets of rectangles, strings, permutations, binary relations, graphs, inverted indexes, document retrieval indexes, full-text indexes, XML indexes, and general numeric sequences.
183
A new succinct representation of RMQ-information and improvements in the enhanced suffix array
Johannes Fischer,Volker Heun +1 more
- 07 Apr 2007
TL;DR: The Range-Minimum-Query-Problem is solved by giving the first algorithm that never uses more than 2n + o(n) bits, and does not rely on rank- and select-queries or other succinct data structures, and a lower bound of 2n - o( n) bits is proved, which makes the algorithm asymptotically optimal.
167
•Posted Content
Higher Lower Bounds from the 3SUM Conjecture
TL;DR: This paper gives new and efficient reductions from 3SUM to offline SetDisjointness and offline SetIntersection and introduces new conditional lower bounds for dynamic versions of Maximum Cardinality Matching, which introduce a new technique for obtaining amortized lower bounds.
135
References
•Book
Modern Information Retrieval
Ricardo Baeza-Yates,Berthier Ribeiro-Neto +1 more
- 15 May 1999
TL;DR: In this article, the authors present a rigorous and complete textbook for a first course on information retrieval from the computer science (as opposed to a user-centred) perspective, which provides an up-to-date student oriented treatment of the subject.
Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology
Susan Holmes,Dan Gusfield +1 more
TL;DR: The author examines the importance of (sub)sequence comparison in molecular biology, core string edits, alignments and dynamic programming, and a deeper look at classical methods for exact string matching.
3.1K
Linear pattern matching algorithms
Peter Weiner
- 15 Oct 1973
TL;DR: A linear time algorithm for obtaining a compacted version of a bi-tree associated with a given string is presented and indicated how to solve several pattern matching problems, including some from [4] in linear time.
2.1K
Fast algorithms for finding nearest common ancestors
Dov Harel,Robert E. Tarjan +1 more
TL;DR: An algorithm for a random access machine with uniform cost measure (and a bound of $\Omega (\log n)$ on the number of bits per word) that requires time per query and preprocessing time is presented, assuming that the collection of trees is static.
1.3K
Scaling and related techniques for geometry problems
Harold N. Gabow,Jon Louis Bentley,Robert E. Tarjan +2 more
- 01 Dec 1984
TL;DR: Three techniques in computational geometry are explored: scaling solves a problem by viewing it at increasing levels of numerical precision; activation is a restricted type of update operation, useful in sweep algorithms; the Cartesian tree is a data structure for problems involving maximums and minimums.
609
Related Papers (5)
Peter Weiner
- 15 Oct 1973
Roberto Grossi,Ankur Gupta,Jeffrey Scott Vitter +2 more
- 12 Jan 2003
Paolo Ferragina,Giovanni Manzini +1 more