Proceedings Article10.1145/1007568.1007650
Information-theoretic tools for mining database structure from large data sets
Periklis Andritsos,Renée J. Miller,Panayiotis Tsaparas +2 more
- 13 Jun 2004
- pp 731-742
TL;DR: This work considers the problem of doing data redesign in an environment where the prescribed model is unknown or incomplete, and proposes a set of information-theoretic tools for finding structural summaries that are useful in characterizing the information content of the data, and ultimately useful in data design.
read more
Abstract: Data design has been characterized as a process of arriving at a design that maximizes the information content of each piece of data (or equivalently, one that minimizes redundancy). Information content (or redundancy) is measured with respect to a prescribed model for the data, a model that is often expressed as a set of constraints. In this work, we consider the problem of doing data redesign in an environment where the prescribed model is unknown or incomplete. Specifically, we consider the problem of finding structural clues in an instance of data, an instance which may contain errors, missing values, and duplicate records. We propose a set of information-theoretic tools for finding structural summaries that are useful in characterizing the information content of the data, and ultimately useful in data design. We provide algorithms for creating these summaries over large, categorical data sets. We study the use of these summaries in one specific physical design task, that of ranking functional dependencies based on their data redundancy. We show how our ranking can be used by a physical data-design tool to find good vertical decompositions of a relation (decompositions that improve the information content of the design). We present an evaluation of the approach on real data sets.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Clean Answers over Dirty Databases: A Probabilistic Approach
Periklis Andritsos,Ariel Fuxman,Renée J. Miller +2 more
- 03 Apr 2006
TL;DR: This work rewrite queries over a database containing duplicates to return each answer with the probability that the answer is in the clean database, and experimentally study the performance of the rewritten queries.
Query by output
Quoc Trung Tran,Chee-Yong Chan,Srinivasan Parthasarathy +2 more
- 29 Jun 2009
TL;DR: This paper presents a novel data-driven approach, called Query By Output (QBO), which can enhance the usability of database systems and designs several optimization techniques to reduce processing overhead and introduce a set of criteria to rank order output queries by various notions of utility.
eTuner: tuning schema matching software using synthetic scenarios
Yoonkyong Lee,Mayssam Sayyadian,AnHai Doan,Arnon Rosenthal +3 more
- 25 Jan 2007
TL;DR: eTuner, an approach to automatically tune schema matching systems, is described, which produced tuned matching systems that achieve higher accuracy than using the systems with currently possible tuning methods.
Patent
User interface for facts query engine with snippets from information sources that include query terms and answer terms
Andrew Hogue
- 24 Mar 2006
TL;DR: In this article, a method and a system for providing snippets of source documents of an answer to a fact query are disclosed, along with Uniform Resource Locators (URL's) of the source documents.
124
A call to arms: revisiting database design
Antonio Badia,Daniel Lemire +1 more
- 17 Nov 2011
TL;DR: The thesis is that database design remains a critical unsolved problem and should be the subject of more research, and put forth arguments to support this viewpoint.
References
•Book
Elements of information theory
Thomas M. Cover,Joy A. Thomas +1 more
- 01 Jan 1991
TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.
•Book
Computers and Intractability: A Guide to the Theory of NP-Completeness
Michael Randolph Garey,David S. Johnson +1 more
- 01 Jan 1979
TL;DR: The second edition of a quarterly column as discussed by the authors provides a continuing update to the list of problems (NP-complete and harder) presented by M. R. Garey and myself in our book "Computers and Intractability: A Guide to the Theory of NP-Completeness,” W. H. Freeman & Co., San Francisco, 1979.
Mining association rules between sets of items in large databases
Rakesh Agrawal,Tomasz Imielinski,Arun N. Swami +2 more
- 01 Jun 1993
TL;DR: An efficient algorithm is presented that generates all significant association rules between items in the database of customer transactions and incorporates buffer management and novel estimation and pruning techniques.