Journal Article10.1145/2590774
Space-Efficient Frameworks for Top-k String Retrieval
TL;DR: This work presents the first linear-space framework that is capable of handling arbitrary score functions with near-optimal O(p + klog k) query time and derives compact space and succinct space indexes (for some specific score functions).
read more
Abstract: The inverted index is the backbone of modern web search engines. For each word in a collection of web documents, the index records the list of documents where this word occurs. Given a set of query words, the job of a search engine is to output a ranked list of the most relevant documents containing the query. However, if the query consists of an arbitrary string—which can be a partial word, multiword phrase, or more generally any sequence of characters—then word boundaries are no longer relevant and we need a different approach. In string retrieval settings, we are given a set D=ld1, d2,d3, m, dDr of D strings with n characters in total taken from an alphabet set Σ = lσr, and the task of the search engine, for a given query pattern P of length p, is to report the “most relevant” strings in D containing P. The query may also consist of two or more patterns. The notion of relevance can be captured by a function score(P,dr), which indicates how relevant document dr is to the pattern P. Some example score functions are the frequency of pattern occurrences, proximity between pattern occurrences, or pattern-independent PageRank of the document.The first formal framework to study such kinds of retrieval problems was given by Muthukrishnan lSODA 2002r. He considered two metrics for relevance: frequency and proximity. He took a threshold-based approach on these metrics and gave data structures that use O(n log n) words of space. We study this problem in a somewhat more natural top-k framework. Here, k is a part of the query, and the top k most relevant (highest-scoring) documents are to be reported in sorted order of score. We present the first linear-space framework (i.e., using O(n) words of space) that is capable of handling arbitrary score functions with near-optimal O(p p klog k) query time. The query time can be made optimal O(ppk) if sorted order is not necessary. Further, we derive compact space and succinct space indexes (for some specific score functions). This space compression comes at the cost of higher query time. At last, we extend our framework to handle the case of multiple patterns. Apart from providing a robust framework, our results also improve many earlier results in index space or query time or both.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Journal of the ACM
Dan Suciu,Victor Vianu +1 more
TL;DR: The following three articles are full versions of extended abstracts that were presented at the Twenty-Third ACM SIGMOD-SigACT-SIGART Symposium on Principles of Database Systems (PODS) and have been reviewed according to the standard JACM refereeing process.
862
Space-Efficient Data Structures, Streams, and Algorithms
Joan Boyar,Faith Ellen +1 more
- 01 Jan 2013
TL;DR: It is proved matching upper and lower bounds for the deterministic and randomized query complexity of Θ(n log n) and Θ (n log log n), respectively.
28
On Hardness of Several String Indexing Problems
Kasper Green Larsen,J. Ian Munro,Jesper Sindahl Nielsen,Sharma V. Thankachan +3 more
- 16 Jun 2014
TL;DR: The two-pattern matching problems ask to index D string documents for answering the following queries efficiently.
18
Space-Efficient Dictionaries for Parameterized and Order-Preserving Pattern Matching
Arnab Ganguly,Wing-Kai Hon,Kunihiko Sadakane,Rahul Shah,Sharma V. Thankachan,Yilin Yang +5 more
- 01 Jan 2016
TL;DR: In this paper, the authors considered two variants of string matching: parameterized matching and order-preserving matching, where the characters of S and S' are partitioned into static characters and parameterized characters.
16
Top-k Term-Proximity in Succinct Space
TL;DR: The first sub-linear space data structure for this relevance function is presented, which uses only o(n) bits on top of any compressed suffix array of $$\mathcal{D}$$D and solves queries in O((p+k) polylogn) time.
References
•Proceedings Article
The PageRank Citation Ranking : Bringing Order to the Web
Lawrence Page,Sergey Brin,Rajeev Motwani,Terry Winograd +3 more
- 11 Nov 1999
TL;DR: This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them, and shows how to efficiently compute PageRank for large numbers of pages.
16.4K
Fast Pattern Matching in Strings
TL;DR: An algorithm is presented which finds all occurrences of one given string within another, in running time proportional to the sum of the lengths of the strings, showing that the set of concatenations of even palindromes, i.e., the language $\{\alpha \alpha ^R\}^*$, can be recognized in linear time.
3.4K
Suffix arrays: a new method for on-line string searches
Udi Manber,Gene Myers +1 more
TL;DR: A new and conceptually simple data structure, called a suffixarray, for on-line string searches is introduced in this paper, and it is believed that suffixarrays will prove to be better in practice than suffixtrees for many applications.
2.4K
Linear pattern matching algorithms
Peter Weiner
- 15 Oct 1973
TL;DR: A linear time algorithm for obtaining a compacted version of a bi-tree associated with a given string is presented and indicated how to solve several pattern matching problems, including some from [4] in linear time.
2.1K
A Space-Economical Suffix Tree Construction Algorithm
TL;DR: A new algorithm is presented for constructing auxiliary digital search trees to aid in exact-match substring searching that has the same asymptotic running time bound as previously published algorithms, but is more economical in space.
1.7K