Compressed data structure

Topic Tools

Papers published on a yearly basis

Papers

Proceedings Article•10.5555/644108.644250•

High-order entropy-compressed text indexes

[...]

Roberto Grossi¹, Ankur Gupta², Jeffrey Scott Vitter³•Institutions (3)

University of Pisa¹, Durham University², Purdue University³

12 Jan 2003

TL;DR: A novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet σ, where each symbol is encoded by lg|σ| bits.

...read moreread less

Abstract: We present a novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet σ, where each symbol is encoded by lgvσv bits. We show that compressed suffix arrays use just nHh + σ bits, while retaining full text indexing functionalities, such as searching any pattern sequence of length m in O(m lg vσv + polylog(n)) time. The term Hh ≤ lg vσv denotes the hth-order empirical entropy of the text, which means that our index is nearly optimal in space apart from lower-order terms, achieving asymptotically the empirical entropy of the text (with a multiplicative constant 1). If the text is highly compressible so that Hn = o(1) and the alphabet size is small, we obtain a text index with o(m) search time that requires only o(n) bits. Further results and tradeoffs are reported in the paper.

...read moreread less

900 citations

Journal Article•10.1145/1082036.1082039•

Indexing compressed text

[...]

Paolo Ferragina¹, Giovanni Manzini²•Institutions (2)

University of Pisa¹, University of Eastern Piedmont²

01 Jul 2005-Journal of the ACM

TL;DR: Two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form are designed and exploits the interplay between two compressors: the Burrows--Wheeler Transform and the LZ78 algorithm.

...read moreread less

Abstract: We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.Our first compressed data structure retrieves the occ occurrences of a pattern P[1,p] within a text T[1,n] in O(p p occ log1pen) time for any chosen e, 0

...read moreread less

752 citations

Proceedings Article•10.1145/3077136.3080780•

Faster BlockMax WAND with Variable-sized Blocks

[...]

Antonio Mallia¹, Giuseppe Ottaviano², Elia Porciani¹, Nicola Tonellotto², Rossano Venturini¹ - Show less +1 more•Institutions (2)

University of Pisa¹, Istituto di Scienza e Tecnologie dell'Informazione²

7 Aug 2017

TL;DR: This work sets up the problem of deciding the block partitioning as an optimization problem which maximizes how accurately the block upper bounds represent the underlying scores, and describes an efficient algorithm to find an approximate solution, with provable approximation guarantees.

...read moreread less

Abstract: Query processing is one of the main bottlenecks in large-scale search engines. Retrieving the top k most relevant documents for a given query can be extremely expensive, as it involves scoring large amounts of documents. Several dynamic pruning techniques have been introduced in the literature to tackle this problem, such as BlockMaxWAND, which splits the inverted index into constant- sized blocks and stores the maximum document-term scores per block; this information can be used during query execution to safely skip low-score documents, producing many-fold speedups over exhaustive methods. We introduce a refinement for BlockMaxWAND that uses variable- sized blocks, rather than constant-sized. We set up the problem of deciding the block partitioning as an optimization problem which maximizes how accurately the block upper bounds represent the underlying scores, and describe an efficient algorithm to find an approximate solution, with provable approximation guarantees. rough an extensive experimental analysis we show that our method significantly outperforms the state of the art roughly by a factor 2×. We also introduce a compressed data structure to represent the additional block information, providing a compression ratio of roughly 50%, while incurring only a small speed degradation, no more than 10% with respect to its uncompressed counterpart.

...read moreread less

63 citations

Proceedings Article•10.1145/1516360.1516448•

LCS-Hist: taming massive high-dimensional data cube compression

[...]

Alfredo Cuzzocrea¹, Paolo Serafino¹•Institutions (1)

University of Calabria¹

24 Mar 2009

TL;DR: This paper proposes LCS-Hist, an innovative multidimensional histogram devising a complex methodology that combines intelligent data modeling and processing techniques in order to tame the annoying problem of compressing massive high-dimensional data cubes.

...read moreread less

Abstract: The problem of efficiently compressing massive high-dimensional data cubes still waits for efficient solutions capable of overcoming well-recognized scalability limitations of state-of-the-art histogram-based techniques, which perform well on small-in-size low-dimensional data cubes, whereas their performance in both representing the input data domain and efficiently supporting approximate query answering against the generated compressed data structure decreases dramatically when data cubes grow in dimension number and size. To overcome this relevant research challenge, in this paper we propose LCS-Hist, an innovative multidimensional histogram devising a complex methodology that combines intelligent data modeling and processing techniques in order to tame the annoying problem of compressing massive high-dimensional data cubes. With respect to similar histogram-based proposals, our technique introduces (i) a surprising consumption of the storage space available to house the compressed representation of the input data cube, and (ii) a superior scalability on high-dimensional data cubes. Finally, several experimental results performed against various classes of data cubes confirm the advantages of LCS-Hist, even in comparison with those given by state-of-the-art similar techniques.

...read moreread less

62 citations

Proceedings Article•10.1109/SSDBM.2006.10•

Accuracy Control in Compressed Multidimensional Data Cubes for Quality of Answer-based OLAP Tools

[...]

Alfredo Cuzzocrea

3 Jul 2006

TL;DR: The proposed technique can be efficiently used in QoA-based OLAP tools, where OLAP users/applications and DW servers are allowed to mediate on the accuracy of (approximate) answers, similarly to what happens in QoS-based systems for the quality of services.

...read moreread less

Abstract: An innovative technique supporting accuracy control in compressed multidimensional data cubes is presented in this paper. The proposed technique can be efficiently used in QoA-based OLAP tools, where OLAP users/applications and DW servers are allowed to mediate on the accuracy of (approximate) answers, similarly to what happens in QoS-based systems for the quality of services. The compressed data structure KLSA, which implements the technique, is also extensively presented and discussed. We complement our analytical contributions with an experimental evaluation on several kinds of synthetic multidimensional data cubes, demonstrating the superiority of our approach in comparison with other similar techniques.

...read moreread less

57 citations

...

Expand

Year	Papers
2021	6
2020	7
2019	9
2018	2
2017	7
2016	6

Topic Tools

Papers published on a yearly basis

Papers

High-order entropy-compressed text indexes

Indexing compressed text

Faster BlockMax WAND with Variable-sized Blocks

LCS-Hist: taming massive high-dimensional data cube compression

Accuracy Control in Compressed Multidimensional Data Cubes for Quality of Answer-based OLAP Tools

Related Topics (5)

Performance Metrics