Flexible and efficient IR using array databases
Roberto Cornacchia,Sándor Héman,Marcin Zukowski,Arjen P. de Vries,Peter Boncz +4 more
- 01 Jan 2008
- Vol. 17, Iss: 1, pp 151-168
TL;DR: It turns out that these optimization rules enable SRAM to automatically translate the BM25 array queries into the relational equivalent of inverted list processing including compression, score materialization and quantization, such as employed by custom-built IR systems.
read more
Abstract: The Matrix Framework is a recent proposal by Information Retrieval (IR) researchers to flexibly represent information retrieval models and concepts in a single multi-dimensional array framework. We provide computational support for exactly this framework with the array database system SRAM (Sparse Relational Array Mapping), that works on top of a DBMS. Information retrieval models can be specified in its comprehension-based array query language, in a way that directly corresponds to the underlying mathematical formulas. SRAM efficiently stores sparse arrays in (compressed) relational tables and translates and optimizes array queries into relational queries. In this work, we describe a number of array query optimization rules. To demonstrate their effect on text retrieval, we apply them in the TREC TeraByte track (TREC-TB) efficiency task, using the Okapi BM25 model as our example. It turns out that these optimization rules enable SRAM to automatically translate the BM25 array queries into the relational equivalent of inverted list processing including compression, score materialization and quantization, such as employed by custom-built IR systems. The use of the high-performance MonetDB/X100 relational backend, that provides transparent database compression, allows the system to achieve very fast response times with good precision and low resource usage.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
The TileDB array data storage manager
Stavros Papadopoulos,Kushal Datta,Samuel Madden,Timothy G. Mattson +3 more
- 01 Nov 2016
TL;DR: This work presents a novel storage manager for multi-dimensional arrays that arise in scientific applications, which is part of a larger scientific data management system called TileDB, and shows that TileDB delivers comparable performance to the HDF5 dense array storage manager, while providing much faster random writes.
Balancing vectorized query execution with bandwidth-optimized storage
M. Żukowski
- 01 Jan 2009
TL;DR: A new database system architecture is presented, realized in the MonetDB/X100 prototype, that combines a coherent set of new architecture-conscious techniques that are designed to work well together and achieves in-memory performance often one or two orders of magnitude higher than the existing approaches.
The OGC web coverage processing service (WCPS) standard
TL;DR: This contribution reports on the WCPS standard by giving an introduction to its coverage model and processing language and design rationales are discussed, as well as background and relation to other OGC standards.
125
•Journal Article
Vectorwise: Beyond Column Stores
Marcin Zukowski,Peter Boncz +1 more
TL;DR: This paper tells the story of Vectorwise, a high-performance analytical database system, from multiple perspectives: its history from academic project to commercial product, the evolution of its technical architecture, customer reactions to the product and its future research and development roadmap.
SciQL, a query language for science applications
Martin L. Kersten,Ying Zhang,Milena Ivanova,Niels Nes +3 more
- 25 Mar 2011
TL;DR: SciQL1 provides a seamless symbiosis of array-, set-, and sequence- interpretation using a clear separation of the mathematical object from its underlying implementation, and leads to a generalization of window-based query processing with wide applicability in science domains.
References
An algorithm for suffix stripping
TL;DR: An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL, and performs slightly better than a much more elaborate system with which it has been compared.
9.1K
A Method for the Construction of Minimum-Redundancy Codes
David A. Huffman
- 01 Sep 1952
TL;DR: A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.
6.1K
A method for the construction of minimum-redundancy codes
TL;DR: A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.
5.2K
A language modeling approach to information retrieval
Jay Ponte,W. Bruce Croft +1 more
- 01 Aug 1998
TL;DR: It will be shown that probabilistic methods can be used to predict topic changes in the context of the task of new event detection and provide further proof of concept for the use of language models for retrieval tasks.
Relevance weighting of search terms
TL;DR: In this article, a series of relevance weighting functions is derived and is justified by theoretical considerations, in particular, it is shown that specific weighted search methods are implied by a general probabilistic theory of retrieval.
2K