Data deduplication

Topic Tools

Papers published on a yearly basis

Papers

Journal Article•10.1109/TKDE.2007.250581•

Duplicate Record Detection: A Survey

[...]

01 Jan 2007-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.

...read moreread less

Abstract: Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area

...read moreread less

2,190 citations

Journal Article•10.1109/TKDE.2007.9•

Duplicate Record Detection: A Survey

[...]

Ahmed K. Elmagarmid¹, Panagiotis G. Ipeirotis², Vassilios S. Verykios²•Institutions (2)

Purdue University¹, IEEE Computer Society²

01 Jan 2007-IEEE Transactions on Knowledge and Data Engineering

...read moreread less

1,647 citations

Patent•

Performing data storage operations with a cloud storage environment, including automatically selecting among multiple cloud storage sites

[...]

Anand Prahlad, Marcus S. Muller, Rajiv Kottomtharayil, Srinivas Kavuri, Parag Gokhale - Show less +1 more

31 Mar 2010

TL;DR: In this article, a variety of data storage operations, including content-indexing, containerized deduplication, and policy-driven storage, within a cloud environment are described.

...read moreread less

Abstract: Systems and methods are disclosed for performing data storage operations, including content-indexing, containerized deduplication, and policy-driven storage, within a cloud environment The systems support a variety of clients and cloud storage sites that may connect to the system in a cloud environment that requires data transfer over wide area networks, such as the Internet, which may have appreciable latency and/or packet loss, using various network protocols, including HTTP and FTP Methods are disclosed for content indexing data stored within a cloud environment to facilitate later searching, including collaborative searching Methods are also disclosed for performing containerized deduplication to reduce the strain on a system namespace, effectuate cost savings, etc Methods are disclosed for identifying suitable storage locations, including suitable cloud storage sites, for data files subject to a storage policy Further, systems and methods for providing a cloud gateway and a scalable data object store within a cloud environment are disclosed, along with other features

...read moreread less

849 citations

Proceedings Article•10.1145/775047.775087•

Interactive deduplication using active learning

[...]

Sunita Sarawagi¹, Anuradha Bhamidipaty¹•Institutions (1)

Indian Institute of Technology Bombay¹

23 Jul 2002

TL;DR: This work presents the design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning and investigates various design issues that arise in building a system to provide interactive response, fast convergence, and interpretable output.

...read moreread less

Abstract: Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to overcome the tedium of hand-coding is to train a classifier to distinguish between duplicates and non-duplicates. The success of this method critically hinges on being able to provide a covering and challenging set of training pairs that bring out the subtlety of deduplication function. This is non-trivial because it requires manually searching for various data inconsistencies between any two records spread apart in large lists.We present our design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning. Our experiments on real-life datasets show that active learning significantly reduces the number of instances needed to achieve high accuracy. We investigate various design issues that arise in building a system to provide interactive response, fast convergence, and interpretable output.

...read moreread less

840 citations

Journal Article•10.1109/TKDE.2011.127•

A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

[...]

Peter Christen¹•Institutions (1)

Australian National University¹

01 Sep 2012-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A survey of 12 variations of 6 indexing techniques for record linkage and deduplication aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality is presented.

...read moreread less

Abstract: Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many application areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process, because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the increasing size of today's databases, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent years, various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality. This paper presents a survey of 12 variations of 6 indexing techniques. Their complexity is analyzed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. No such detailed survey has so far been published.

...read moreread less

799 citations

...

Expand

Year	Papers
2026	2
2025	93
2024	132
2023	148
2022	292
2021	240

Topic Tools

Papers published on a yearly basis

Papers

Duplicate Record Detection: A Survey

Duplicate Record Detection: A Survey

Performing data storage operations with a cloud storage environment, including automatically selecting among multiple cloud storage sites

Interactive deduplication using active learning

A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

Related Topics (5)

Performance Metrics