Journal Article10.14778/2983200.2983203
Distributed data deduplication
Xu Chu,Ihab F. Ilyas,Paraschos Koutris +2 more
- 01 Jul 2016
- Vol. 9, Iss: 11, pp 864-875
TL;DR: This paper shows how to further speed up data deduplication by leveraging parallelism in a shared-nothing computing environment and proposes a distribution strategy, called Dis-Dedup, that minimizes the maximum workload across all worker nodes and provides strong theoretical guarantees.
read more
Abstract: Data deduplication refers to the process of identifying tuples in a relation that refer to the same real world entity. The complexity of the problem is inherently quadratic with respect to the number of tuples, since a similarity value must be computed for every pair of tuples. To avoid comparing tuple pairs that are obviously non-duplicates, blocking techniques are used to divide the tuples into blocks and only tuples within the same block are compared. However, even with the use of blocking, data deduplication remains a costly problem for large datasets. In this paper, we show how to further speed up data deduplication by leveraging parallelism in a shared-nothing computing environment. Our main contribution is a distribution strategy, called Dis-Dedup, that minimizes the maximum workload across all worker nodes and provides strong theoretical guarantees. We demonstrate the effectiveness of our proposed strategy by performing extensive experiments on both synthetic datasets with varying block size distributions, as well as real world datasets.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Data Cleaning: Overview and Emerging Challenges
Xu Chu,Ihab F. Ilyas,Sanjay Krishnan,Jiannan Wang +3 more
- 26 Jun 2016
TL;DR: This work presents a taxonomy of the data cleaning literature and discusses recent work that casts such approaches into a statistical estimation framework including: using Machine Learning to improve the efficiency and accuracy of data cleaning and considering the effects of data cleaned on statistical analysis.
483
Magellan: toward building entity matching management systems
Pradap Konda,Sanjib Das,G C Paul Suganthan,AnHai Doan,Adel Ardalan,Jeff Ballard,Han Li,Fatemah Panahi,Haojun Zhang,Jeffrey F. Naughton,Shishir Prasad,Ganesh Krishnan,Rohit Deep,Vijay Raghavendra +13 more
- 01 Aug 2016
TL;DR: Magellan is novel in four important aspects: it provides how-to guides that tell users what to do in each EM scenario, step by step, and provides tools to help users do these steps; the tools seek to cover the entire EM pipeline, not just matching and blocking as current EM systems do.
An Overview of End-to-End Entity Resolution for Big Data
TL;DR: This survey provides an end-to-end view of ER workflows for Big Data, critically review the pros and cons of existing methods, and concludes with the main open research directions.
Blocking and Filtering Techniques for Entity Resolution: A Survey
TL;DR: In this paper, a large number of relevant works under two different but related frameworks, blocking and filtering, are reviewed, and a comprehensive list of the relevant works, discussing them in the greater context is provided.
Collective Multi-type Entity Alignment Between Knowledge Graphs
Qi Zhu,Hao Wei,Bunyamin Sisman,Da Zheng,Christos Faloutsos,Xin Luna Dong,Jiawei Han +6 more
- 20 Apr 2020
TL;DR: A novel collective aggregation function tailored for Multi-type entity Alignment, called CG-MuAlign, that relieves the incompleteness of knowledge graphs via both cross-graph and self attentions, and scales up efficiently with mini-batch training paradigm and effective neighborhood sampling strategy.
61
References
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
- 06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
•Proceedings Article
Spark: cluster computing with working sets
Matei Zaharia,Mosharaf Chowdhury,Michael J. Franklin,Scott Shenker,Ion Stoica +4 more
- 22 Jun 2010
TL;DR: Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.
Duplicate Record Detection: A Survey
Elmagarmid,Ipeirotis,Verykios +2 more
TL;DR: This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.
Duplicate Record Detection: A Survey
TL;DR: This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.