About: Compressed data structure is a research topic. Over the lifetime, 95 publications have been published within this topic receiving 3279 citations.
TL;DR: A novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet σ, where each symbol is encoded by lg|σ| bits.
Abstract: We present a novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet σ, where each symbol is encoded by lgvσv bits. We show that compressed suffix arrays use just nHh + σ bits, while retaining full text indexing functionalities, such as searching any pattern sequence of length m in O(m lg vσv + polylog(n)) time. The term Hh ≤ lg vσv denotes the hth-order empirical entropy of the text, which means that our index is nearly optimal in space apart from lower-order terms, achieving asymptotically the empirical entropy of the text (with a multiplicative constant 1). If the text is highly compressible so that Hn = o(1) and the alphabet size is small, we obtain a text index with o(m) search time that requires only o(n) bits. Further results and tradeoffs are reported in the paper.
TL;DR: Two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form are designed and exploits the interplay between two compressors: the Burrows--Wheeler Transform and the LZ78 algorithm.
Abstract: We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.Our first compressed data structure retrieves the occ occurrences of a pattern P[1,p] within a text T[1,n] in O(p p occ log1pen) time for any chosen e, 0
TL;DR: This work sets up the problem of deciding the block partitioning as an optimization problem which maximizes how accurately the block upper bounds represent the underlying scores, and describes an efficient algorithm to find an approximate solution, with provable approximation guarantees.
Abstract: Query processing is one of the main bottlenecks in large-scale search engines. Retrieving the top k most relevant documents for a given query can be extremely expensive, as it involves scoring large amounts of documents. Several dynamic pruning techniques have been introduced in the literature to tackle this problem, such as BlockMaxWAND, which splits the inverted index into constant- sized blocks and stores the maximum document-term scores per block; this information can be used during query execution to safely skip low-score documents, producing many-fold speedups over exhaustive methods. We introduce a refinement for BlockMaxWAND that uses variable- sized blocks, rather than constant-sized. We set up the problem of deciding the block partitioning as an optimization problem which maximizes how accurately the block upper bounds represent the underlying scores, and describe an efficient algorithm to find an approximate solution, with provable approximation guarantees. rough an extensive experimental analysis we show that our method significantly outperforms the state of the art roughly by a factor 2×. We also introduce a compressed data structure to represent the additional block information, providing a compression ratio of roughly 50%, while incurring only a small speed degradation, no more than 10% with respect to its uncompressed counterpart.
TL;DR: This paper proposes LCS-Hist, an innovative multidimensional histogram devising a complex methodology that combines intelligent data modeling and processing techniques in order to tame the annoying problem of compressing massive high-dimensional data cubes.
Abstract: The problem of efficiently compressing massive high-dimensional data cubes still waits for efficient solutions capable of overcoming well-recognized scalability limitations of state-of-the-art histogram-based techniques, which perform well on small-in-size low-dimensional data cubes, whereas their performance in both representing the input data domain and efficiently supporting approximate query answering against the generated compressed data structure decreases dramatically when data cubes grow in dimension number and size. To overcome this relevant research challenge, in this paper we propose LCS-Hist, an innovative multidimensional histogram devising a complex methodology that combines intelligent data modeling and processing techniques in order to tame the annoying problem of compressing massive high-dimensional data cubes. With respect to similar histogram-based proposals, our technique introduces (i) a surprising consumption of the storage space available to house the compressed representation of the input data cube, and (ii) a superior scalability on high-dimensional data cubes. Finally, several experimental results performed against various classes of data cubes confirm the advantages of LCS-Hist, even in comparison with those given by state-of-the-art similar techniques.
TL;DR: The proposed technique can be efficiently used in QoA-based OLAP tools, where OLAP users/applications and DW servers are allowed to mediate on the accuracy of (approximate) answers, similarly to what happens in QoS-based systems for the quality of services.
Abstract: An innovative technique supporting accuracy control in compressed multidimensional data cubes is presented in this paper. The proposed technique can be efficiently used in QoA-based OLAP tools, where OLAP users/applications and DW servers are allowed to mediate on the accuracy of (approximate) answers, similarly to what happens in QoS-based systems for the quality of services. The compressed data structure KLSA, which implements the technique, is also extensively presented and discussed. We complement our analytical contributions with an experimental evaluation on several kinds of synthetic multidimensional data cubes, demonstrating the superiority of our approach in comparison with other similar techniques.