Space-Efficient Frameworks for Top-k String Retrieval

doi:10.1145/2590774

Journal Article10.1145/2590774

Space-Efficient Frameworks for Top-k String Retrieval

Wing-Kai Hon, +3 more

- 24 Apr 2014

- Journal of the ACM

- Vol. 61, Iss: 2, pp 9

45

TL;DR: This work presents the first linear-space framework that is capable of handling arbitrary score functions with near-optimal O(p + klog k) query time and derives compact space and succinct space indexes (for some specific score functions).

Abstract: The inverted index is the backbone of modern web search engines. For each word in a collection of web documents, the index records the list of documents where this word occurs. Given a set of query words, the job of a search engine is to output a ranked list of the most relevant documents containing the query. However, if the query consists of an arbitrary string—which can be a partial word, multiword phrase, or more generally any sequence of characters—then word boundaries are no longer relevant and we need a different approach. In string retrieval settings, we are given a set D=ld1, d2,d3, m, dDr of D strings with n characters in total taken from an alphabet set Σ = lσr, and the task of the search engine, for a given query pattern P of length p, is to report the “most relevant” strings in D containing P. The query may also consist of two or more patterns. The notion of relevance can be captured by a function score(P,dr), which indicates how relevant document dr is to the pattern P. Some example score functions are the frequency of pattern occurrences, proximity between pattern occurrences, or pattern-independent PageRank of the document.The first formal framework to study such kinds of retrieval problems was given by Muthukrishnan lSODA 2002r. He considered two metrics for relevance: frequency and proximity. He took a threshold-based approach on these metrics and gave data structures that use O(n log n) words of space. We study this problem in a somewhat more natural top-k framework. Here, k is a part of the query, and the top k most relevant (highest-scoring) documents are to be reported in sorted order of score. We present the first linear-space framework (i.e., using O(n) words of space) that is capable of handling arbitrary score functions with near-optimal O(p p klog k) query time. The query time can be made optimal O(ppk) if sorted order is not necessary. Further, we derive compact space and succinct space indexes (for some specific score functions). This space compression comes at the cost of higher query time. At last, we extend our framework to handle the case of multiple patterns. Apart from providing a robust framework, our results also improve many earlier results in index space or query time or both.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1145/1131342.1131343

Journal of the ACM

Dan Suciu, +1 more

- 01 Jan 2006

- Journal of the ACM

TL;DR: The following three articles are full versions of extended abstracts that were presented at the Twenty-Third ACM SIGMOD-SigACT-SIGART Symposium on Principles of Database Systems (PODS) and have been reviewed according to the standard JACM refereeing process.

...read moreread less

862

Space-Efficient Data Structures, Streams, and Algorithms

Joan Boyar, +1 more

- 01 Jan 2013

TL;DR: It is proved matching upper and lower bounds for the deterministic and randomized query complexity of Θ(n log n) and Θ (n log log n), respectively.

...read moreread less

28

Book Chapter•10.1007/978-3-319-07566-2_25

On Hardness of Several String Indexing Problems

Kasper Green Larsen, +3 more

- 16 Jun 2014

TL;DR: The two-pattern matching problems ask to index D string documents for answering the following queries efficiently.

...read moreread less

18

•Proceedings Article•10.4230/LIPICS.CPM.2016.2

Space-Efficient Dictionaries for Parameterized and Order-Preserving Pattern Matching

Arnab Ganguly, +5 more

- 01 Jan 2016

TL;DR: In this paper, the authors considered two variants of string matching: parameterized matching and order-preserving matching, where the characters of S and S' are partitioned into static characters and parameterized characters.

...read moreread less

16

•Journal Article•10.1007/S00453-016-0167-2

Top-k Term-Proximity in Succinct Space

J. Ian Munro, +4 more

- 01 Jun 2017

- Algorithmica

TL;DR: The first sub-linear space data structure for this relevance function is presented, which uses only o(n) bits on top of any compressed suffix array of $$\mathcal{D}$$D and solves queries in O((p+k) polylogn) time.

...read moreread less

13

...

Expand

References

•Proceedings Article

The PageRank Citation Ranking : Bringing Order to the Web

Lawrence Page, +3 more

- 11 Nov 1999

TL;DR: This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them, and shows how to efficiently compute PageRank for large numbers of pages.

...read moreread less

16.4K

Journal Article•10.1137/0206024

Fast Pattern Matching in Strings

Donald E. Knuth, +2 more

- 01 Jun 1977

- SIAM Journal on Computing

TL;DR: An algorithm is presented which finds all occurrences of one given string within another, in running time proportional to the sum of the lengths of the strings, showing that the set of concatenations of even palindromes, i.e., the language $\{\alpha \alpha ^R\}^*$, can be recognized in linear time.

...read moreread less

3.4K

Journal Article•10.1137/0222058

Suffix arrays: a new method for on-line string searches

Udi Manber, +1 more

- 01 Oct 1993

- SIAM Journal on Computing

TL;DR: A new and conceptually simple data structure, called a suffixarray, for on-line string searches is introduced in this paper, and it is believed that suffixarrays will prove to be better in practice than suffixtrees for many applications.

...read moreread less

2.4K

Proceedings Article•10.1109/SWAT.1973.13

Linear pattern matching algorithms

Peter Weiner

- 15 Oct 1973

TL;DR: A linear time algorithm for obtaining a compacted version of a bi-tree associated with a given string is presented and indicated how to solve several pattern matching problems, including some from [4] in linear time.

...read moreread less

2.1K

Journal Article•10.1145/321941.321946

A Space-Economical Suffix Tree Construction Algorithm

Edward M. McCreight

- 01 Apr 1976

- Journal of the ACM

TL;DR: A new algorithm is presented for constructing auxiliary digital search trees to aid in exact-match substring searching that has the same asymptotic running time bound as previously published algorithms, but is more economical in space.

...read moreread less

1.7K

...

Expand

Space-Efficient Frameworks for Top-k String Retrieval

Chat with Paper

AI Agents for this Paper

Citations

Journal of the ACM

Space-Efficient Data Structures, Streams, and Algorithms

On Hardness of Several String Indexing Problems

Space-Efficient Dictionaries for Parameterized and Order-Preserving Pattern Matching

Top-k Term-Proximity in Succinct Space

References

The PageRank Citation Ranking : Bringing Order to the Web

Fast Pattern Matching in Strings

Suffix arrays: a new method for on-line string searches

Linear pattern matching algorithms

A Space-Economical Suffix Tree Construction Algorithm

Related Papers (5)

Efficient algorithms for document retrieval problems

Linear pattern matching algorithms

Suffix arrays: a new method for on-line string searches

Indexing compressed text

Augmenting Suffix Trees, with Applications