Conference
Quality in Databases
About: Quality in Databases is an academic conference. The conference publishes majorly in the area(s): Data quality & Data warehouse. Over the lifetime, 13 publications have been published by the conference receiving 134 citations.
Papers
Proceedings Article•
1 Jan 2007
TL;DR: This paper investigates the main characteristics and peculiarities of ETL processes and proposes a principled organization of test suites for the problem of experimenting with ETL scenarios.
Abstract: Extraction–Transform–Load (ETL) processes comprise complex data workflows, which are responsible for the maintenance of a Data Warehouse Their practical importance is denoted by the fact that a plethora of ETL tools currently constitutes a multi-million dollars market However, each one of them follows a different design and modeling technique and internal language So far, the research community has not agreed upon the basic characteristics of ETL tools Hence, there is a necessity for a unified way to assess ETL workflows In this paper, we investigate the main characteristics and peculiarities of ETL processes and we propose a principled organization of test suites for the problem of experimenting with ETL scenarios
36 citations
Proceedings Article•
1 Jan 2007TL;DR: This work evaluates the accuracy of the similarity measures used in data cleaning and integration methodologies based on q-grams and thoroughly compares their accuracy on several datasets with different characteristics.
Abstract: Approximate join is an important part of many data cleaning and integration methodologies. Various similarity measures have been proposed for accurate and efficient matching of string attributes. The accuracy of the similarity measures highly depends on the characteristics of the data such as the amount and type of the errors and length of the strings. Recently, there has been an increasing interest in using methods based on q-grams (substrings of length q) made out of the strings, mainly due to their high efficiency. In this work, we evaluate the accuracy of the similarity measures used in these methodologies. We present an overview of several similarity measures based on q-grams. We then thoroughly compare their accuracy on several datasets with different characteristics. Since the efficiency of approximate joins depends on the similarity threshold they use, we study how the value of the threshold (including values used in recent performance studies) affects the accuracy of the join. We also compare different measures based on the highest accuracy they can achieve on different datasets.
23 citations
Proceedings Article•
1 Jan 2008TL;DR: A class of programs that appear to the user as if they are running in a single world rather than on a set of possible worlds is studied, and an algorithm for efficiently verifying this property is presented.
Abstract: We study database application programming interfaces for uncertain and probabilistic databases and present a programming model that is independent of representation details. Conceptually, we use the possible worlds semantics, and programs are independently evaluated in each world. We study a class of programs that appear to the user as if they are running in a single world rather than on a set of possible worlds. We present an algorithm for efficiently verifying this property. We discuss how updates can be implemented in uncertain database management systems, and propose techniques for optimizing database programs.
12 citations
Proceedings Article•
1 Jan 2012TL;DR: This work proposes an approach to accelerate the online maintenance of knowledge bases, called LAIKA, based on link prediction, which constructs a large graph from the available input and uses link-overlap measures and random-walk techniques to generate missing links and rank them for recommendations.
Abstract: Knowledge-sharing communities like Wikipedia and knowledge bases like Freebase are expected to capture the latest facts about the real world. However, neither of these can keep pace with the rate at which events happen and new knowledge is reported in news and social media. To narrow this gap, we propose an approach to accelerate the online maintenance of knowledge bases. Our method, called LAIKA, is based on link prediction. Wikipedia editions in dierent languages, Wikinews, and other news media come with extensive but noisy interlinkage at the entity level. We utilize this input for recommending, for a given Wikipedia article or knowledge-base entry, new categories, related entities, and cross-lingual interwiki links. LAIKA constructs a large graph from the available input and uses link-overlap measures and random-walk techniques to generate missing links and rank them for recommendations. Experiments with a very large graph from multilingual Wikipedia editions demonstrate the accuracy of our link predictions.
9 citations
Proceedings Article•
1 Jan 2008TL;DR: This paper describes an initial data architecture to support and validate the domain-specic families of completeness measures that users can choose from, and shows how dimensional completeness Measures can be supported in practice by extending the Quality View model.
Abstract: Completeness is a well-understood dimension of data quality. In particular, measures of coverage can be used to assess the completeness of a data source, relative to some universe, for instance a collection of reference databases. We observe that this definition is inherently and implicitly multidimensional:
in principle, one can compute measures of coverage
that are expressed as a combination of subset of the
attributes in the data source schema. This generalization
can be useful in several application domains, notably in the
life sciences. This leads to the idea of domain-specic families of completeness measures that users can choose from. Furthermore, individuals in the family can be specified as OLAP-type queries on a dimensional schema. In this paper we describe an initial data architecture to support and validate the idea, and show how dimensional completeness measures can be supported in practice by extending the Quality View model [11].
7 citations
Performance Metrics
| Year | Papers |
|---|---|
| 2016 | 1 |
| 2012 | 1 |
| 2009 | 1 |
| 2008 | 2 |
| 2007 | 8 |