Information-theoretic tools for mining database structure from large data sets

doi:10.1145/1007568.1007650

Proceedings Article10.1145/1007568.1007650

Information-theoretic tools for mining database structure from large data sets

Periklis Andritsos, +2 more

- 13 Jun 2004

- pp 731-742

88

TL;DR: This work considers the problem of doing data redesign in an environment where the prescribed model is unknown or incomplete, and proposes a set of information-theoretic tools for finding structural summaries that are useful in characterizing the information content of the data, and ultimately useful in data design.

Abstract: Data design has been characterized as a process of arriving at a design that maximizes the information content of each piece of data (or equivalently, one that minimizes redundancy). Information content (or redundancy) is measured with respect to a prescribed model for the data, a model that is often expressed as a set of constraints. In this work, we consider the problem of doing data redesign in an environment where the prescribed model is unknown or incomplete. Specifically, we consider the problem of finding structural clues in an instance of data, an instance which may contain errors, missing values, and duplicate records. We propose a set of information-theoretic tools for finding structural summaries that are useful in characterizing the information content of the data, and ultimately useful in data design. We provide algorithms for creating these summaries over large, categorical data sets. We study the use of these summaries in one specific physical design task, that of ranking functional dependencies based on their data redundancy. We show how our ranking can be used by a physical data-design tool to find good vertical decompositions of a relation (decompositions that improve the information content of the design). We present an evaluation of the approach on real data sets.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Proceedings Article•10.1109/ICDE.2006.35

Clean Answers over Dirty Databases: A Probabilistic Approach

Periklis Andritsos, +2 more

- 03 Apr 2006

TL;DR: This work rewrite queries over a database containing duplicates to return each answer with the probability that the answer is in the clean database, and experimentally study the performance of the rewritten queries.

...read moreread less

210

Proceedings Article•10.1145/1559845.1559902

Query by output

Quoc Trung Tran, +2 more

- 29 Jun 2009

TL;DR: This paper presents a novel data-driven approach, called Query By Output (QBO), which can enhance the usability of database systems and designs several optimization techniques to reduce processing overhead and introduce a set of criteria to rank order output queries by various notions of utility.

...read moreread less

162

Journal Article•10.1007/S00778-006-0024-Z

eTuner: tuning schema matching software using synthetic scenarios

Yoonkyong Lee, +3 more

- 25 Jan 2007

TL;DR: eTuner, an approach to automatically tune schema matching systems, is described, which produced tuned matching systems that achieve higher accuracy than using the systems with currently possible tuning methods.

...read moreread less

156

Patent

User interface for facts query engine with snippets from information sources that include query terms and answer terms

Andrew Hogue

- 24 Mar 2006

TL;DR: In this article, a method and a system for providing snippets of source documents of an answer to a fact query are disclosed, along with Uniform Resource Locators (URL's) of the source documents.

...read moreread less

124

•Journal Article•10.1145/2070736.2070750

A call to arms: revisiting database design

Antonio Badia, +1 more

- 17 Nov 2011

TL;DR: The thesis is that database design remains a critical unsolved problem and should be the subject of more research, and put forth arguments to support this viewpoint.

...read moreread less

106

...

Expand

References

•Book

Elements of information theory

Thomas M. Cover, +1 more

- 01 Jan 1991

TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.

...read moreread less

52.2K

•Journal Article•10.1136/BJO.46.11.704

A and V.

Robert W. Stephenson

- 01 Nov 1962

- British Journal of Ophthalmology

46.7K

•Book

Computers and Intractability: A Guide to the Theory of NP-Completeness

Michael Randolph Garey, +1 more

- 01 Jan 1979

TL;DR: The second edition of a quarterly column as discussed by the authors provides a continuing update to the list of problems (NP-complete and harder) presented by M. R. Garey and myself in our book "Computers and Intractability: A Guide to the Theory of NP-Completeness,” W. H. Freeman & Co., San Francisco, 1979.

...read moreread less

46.2K

Johnson: Computers and Intractability-A Guide to the Theory of NP-Completeness

Michael Randolph Garey

- 01 Jan 1979

42.6K

Proceedings Article•10.1145/170035.170072

Mining association rules between sets of items in large databases

Rakesh Agrawal, +2 more

- 01 Jun 1993

TL;DR: An efficient algorithm is presented that generates all significant association rules between items in the database of customer transactions and incorporates buffer management and novel estimation and pruning techniques.

...read moreread less

17K