Quality in Databases

Conference Tools

Papers

Proceedings Article•

Towards a Benchmark for ETL Workflows.

[...]

Panos Vassiliadis, Anastasios Karagiannis, Vasiliki Tziovara, Alkis Simitsis

1 Jan 2007

TL;DR: This paper investigates the main characteristics and peculiarities of ETL processes and proposes a principled organization of test suites for the problem of experimenting with ETL scenarios.

...read moreread less

Abstract: Extraction–Transform–Load (ETL) processes comprise complex data workflows, which are responsible for the maintenance of a Data Warehouse Their practical importance is denoted by the fact that a plethora of ETL tools currently constitutes a multi-million dollars market However, each one of them follows a different design and modeling technique and internal language So far, the research community has not agreed upon the basic characteristics of ETL tools Hence, there is a necessity for a unified way to assess ETL workflows In this paper, we investigate the main characteristics and peculiarities of ETL processes and we propose a principled organization of test suites for the problem of experimenting with ETL scenarios

...read moreread less

36 citations

Proceedings Article•

Accuracy of Approximate String Joins Using Grams

[...]

Oktie Hassanzadeh¹, Mohammad Sadoghi¹, Renée J. Miller¹•Institutions (1)

University of Toronto¹

1 Jan 2007

TL;DR: This work evaluates the accuracy of the similarity measures used in data cleaning and integration methodologies based on q-grams and thoroughly compares their accuracy on several datasets with different characteristics.

...read moreread less

Abstract: Approximate join is an important part of many data cleaning and integration methodologies. Various similarity measures have been proposed for accurate and efficient matching of string attributes. The accuracy of the similarity measures highly depends on the characteristics of the data such as the amount and type of the errors and length of the strings. Recently, there has been an increasing interest in using methods based on q-grams (substrings of length q) made out of the strings, mainly due to their high efficiency. In this work, we evaluate the accuracy of the similarity measures used in these methodologies. We present an overview of several similarity measures based on q-grams. We then thoroughly compare their accuracy on several datasets with different characteristics. Since the efficiency of approximate joins depends on the similarity threshold they use, we study how the value of the threshold (including values used in recent performance studies) affects the accuracy of the join. We also compare different measures based on the highest accuracy they can achieve on different datasets.

...read moreread less

23 citations

Proceedings Article•

On APIs for probabilistic databases

[...]

Lyublena Antova, Christoph Koch¹•Institutions (1)

Max Planck Society¹

1 Jan 2008

TL;DR: A class of programs that appear to the user as if they are running in a single world rather than on a set of possible worlds is studied, and an algorithm for efficiently verifying this property is presented.

...read moreread less

Abstract: We study database application programming interfaces for uncertain and probabilistic databases and present a programming model that is independent of representation details. Conceptually, we use the possible worlds semantics, and programs are independently evaluated in each world. We study a class of programs that appear to the user as if they are running in a single world rather than on a set of possible worlds. We present an algorithm for efficiently verifying this property. We discuss how updates can be implemented in uncertain database management systems, and propose techniques for optimizing database programs.

...read moreread less

12 citations

Proceedings Article•

Cross-lingual Data Quality for Knowledge Base Acceleration Across Wikipedia Editions

[...]

Julianna Göbölös-Szabó¹, Natalia Prytkova², Marc Spaniol², Gerhard Weikum²•Institutions (2)

Hungarian Academy of Sciences¹, Max Planck Society²

1 Jan 2012

TL;DR: This work proposes an approach to accelerate the online maintenance of knowledge bases, called LAIKA, based on link prediction, which constructs a large graph from the available input and uses link-overlap measures and random-walk techniques to generate missing links and rank them for recommendations.

...read moreread less

Abstract: Knowledge-sharing communities like Wikipedia and knowledge bases like Freebase are expected to capture the latest facts about the real world. However, neither of these can keep pace with the rate at which events happen and new knowledge is reported in news and social media. To narrow this gap, we propose an approach to accelerate the online maintenance of knowledge bases. Our method, called LAIKA, is based on link prediction. Wikipedia editions in dierent languages, Wikinews, and other news media come with extensive but noisy interlinkage at the entity level. We utilize this input for recommending, for a given Wikipedia article or knowledge-base entry, new categories, related entities, and cross-lingual interwiki links. LAIKA constructs a large graph from the available input and uses link-overlap measures and random-walk techniques to generate missing links and rank them for recommendations. Experiments with a very large graph from multilingual Wikipedia editions demonstrate the accuracy of our link predictions.

...read moreread less

9 citations

Proceedings Article•

Model-Driven Component Generation for Families of Completeness

[...]

Nurul A. Emran¹, Suzanne M. Embury, Paolo Missier•Institutions (1)

Universiti Teknikal Malaysia Melaka¹

1 Jan 2008

TL;DR: This paper describes an initial data architecture to support and validate the domain-specic families of completeness measures that users can choose from, and shows how dimensional completeness Measures can be supported in practice by extending the Quality View model.

...read moreread less

Abstract: Completeness is a well-understood dimension of data quality. In particular, measures of coverage can be used to assess the completeness of a data source, relative to some universe, for instance a collection of reference databases. We observe that this definition is inherently and implicitly multidimensional: in principle, one can compute measures of coverage that are expressed as a combination of subset of the attributes in the data source schema. This generalization can be useful in several application domains, notably in the life sciences. This leads to the idea of domain-specic families of completeness measures that users can choose from. Furthermore, individuals in the family can be specified as OLAP-type queries on a dimensional schema. In this paper we describe an initial data architecture to support and validate the idea, and show how dimensional completeness measures can be supported in practice by extending the Quality View model [11].

...read moreread less

7 citations

Conference Tools

Papers

Towards a Benchmark for ETL Workflows.

Accuracy of Approximate String Joins Using Grams

On APIs for probabilistic databases

Cross-lingual Data Quality for Knowledge Base Acceleration Across Wikipedia Editions

Model-Driven Component Generation for Families of Completeness

Performance Metrics