Distributed data deduplication

doi:10.14778/2983200.2983203

Journal Article10.14778/2983200.2983203

Distributed data deduplication

Xu Chu, +2 more

- 01 Jul 2016

- Vol. 9, Iss: 11, pp 864-875

81

TL;DR: This paper shows how to further speed up data deduplication by leveraging parallelism in a shared-nothing computing environment and proposes a distribution strategy, called Dis-Dedup, that minimizes the maximum workload across all worker nodes and provides strong theoretical guarantees.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Proceedings Article•10.1145/2882903.2912574

Data Cleaning: Overview and Emerging Challenges

Xu Chu, +3 more

- 26 Jun 2016

TL;DR: This work presents a taxonomy of the data cleaning literature and discusses recent work that casts such approaches into a statistical estimation framework including: using Machine Learning to improve the efficiency and accuracy of data cleaning and considering the effects of data cleaned on statistical analysis.

...read moreread less

483

Journal Article•10.14778/2994509.2994535

Magellan: toward building entity matching management systems

Pradap Konda, +13 more

- 01 Aug 2016

TL;DR: Magellan is novel in four important aspects: it provides how-to guides that tell users what to do in each EM scenario, step by step, and provides tools to help users do these steps; the tools seek to cover the entire EM pipeline, not just matching and blocking as current EM systems do.

...read moreread less

256

•Journal Article•10.1145/3418896

An Overview of End-to-End Entity Resolution for Big Data

Vassilis Christophides, +4 more

- 06 Dec 2020

- ACM Computing Surveys

TL;DR: This survey provides an end-to-end view of ER workflows for Big Data, critically review the pros and cons of existing methods, and concludes with the main open research directions.

...read moreread less

220

•Journal Article•10.1145/3377455

Blocking and Filtering Techniques for Entity Resolution: A Survey

George Papadakis, +3 more

- 13 Mar 2020

- ACM Computing Surveys

TL;DR: In this paper, a large number of relevant works under two different but related frameworks, blocking and filtering, are reviewed, and a comprehensive list of the relevant works, discussing them in the greater context is provided.

...read moreread less

161

Proceedings Article•10.1145/3366423.3380289

Collective Multi-type Entity Alignment Between Knowledge Graphs

Qi Zhu, +6 more

- 20 Apr 2020

TL;DR: A novel collective aggregation function tailored for Multi-type entity Alignment, called CG-MuAlign, that relieves the incompleteness of knowledge graphs via both cross-graph and self attentions, and scales up efficiently with mini-batch training paradigm and effective neighborhood sampling strategy.

...read moreread less

61

...

Expand

References

Journal Article•10.21276/IJRE.2018.5.5.4

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 06 Dec 2004

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

22.7K

Journal Article•10.1145/1327452.1327492

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 01 Jan 2008

- Communications of The ACM

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

...read moreread less

18.6K

•Proceedings Article

Spark: cluster computing with working sets

Matei Zaharia, +4 more

- 22 Jun 2010

TL;DR: Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.

...read moreread less

5.3K

•Journal Article•10.1109/TKDE.2007.250581

Duplicate Record Detection: A Survey

Elmagarmid, +2 more

- 01 Jan 2007

- IEEE Transactions on Knowledge and Data ...

TL;DR: This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.

...read moreread less

2.1K

•Journal Article•10.1109/TKDE.2007.9

Duplicate Record Detection: A Survey

Ahmed K. Elmagarmid, +2 more

- 01 Jan 2007

- IEEE Transactions on Knowledge and Data ...

TL;DR: This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.

...read moreread less

1.6K