About: Record linkage is a research topic. Over the lifetime, 1560 publications have been published within this topic receiving 45533 citations. The topic is also known as: duplicate detection.
TL;DR: The Avon Longitudinal Study of Children and Parents (ALSPAC) was established to understand how genetic and environmental characteristics influence health and development in parents and children.
Abstract: Summary The Avon Longitudinal Study of Children and Parents (ALSPAC) was established to understand how genetic and environmental characteristics influence health and development in parents and children. All pregnant women resident in a defined area in the South West of England, with an expected date of delivery between 1st April 1991 and 31st December 1992, were eligible and 13 761 women (contributing 13 867 pregnancies) were recruited. These women have been followed over the last 19–22 years and have completed up to 20 questionnaires, have had detailed data abstracted from their medical records and have information on any cancer diagnoses and deaths through record linkage. A follow-up assessment was completed 17–18 years postnatal at which anthropometry, blood pressure, fat, lean and bone mass and carotid intima media thickness were assessed, and a fasting blood sample taken. The second follow-up clinic, which additionally measures cognitive function, physical capability, physical activity (with accelerometer) and wrist bone architecture, is underway and two further assessments with similar measurements will take place over the next 5 years. There is a detailed biobank that includes DNA, with genome-wide data available on >10 000, stored serum and plasma taken repeatedly since pregnancy and other samples; a wide range of data on completed biospecimen assays are available. Details of how to access these data are provided in this cohort profile.
TL;DR: This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.
Abstract: Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area
TL;DR: The Western Australian Health Services Research Linked Database is introduced as infrastructure to support aetlologic, utilisation and outcomes research and to compare the study population, data resources, technical systems and organisational supports with international best practice.
TL;DR: This paper provides an overview of methods and systems developed for record linkage based on the formal mathematical model of Fellegi and Sunter, and highlights the work of Larsen and Rubin.
Abstract: This paper provides an overview of methods and systems developed for record linkage. Modern record linkage begins with the pioneering work of Newcombe and is especially based on the formal mathematical model of Fellegi and Sunter. In their seminal work, Fellegi and Sunter introduced many powerful ideas for estimating record linkage parameters and other ideas that still influence record linkage today. Record linkage research is characterized by its synergism of statistics, computer science, and operations research. Many difficult algorithms have been developed and put in software systems. Record linkage practice is still very limited. Some limits are due to existing software. Other limits are due to the difficulty in automatically estimating matching parameters and error rates, with current research highlighted by the work of Larsen and Rubin.
TL;DR: In this article, the concept of uniquely identifiable data sets is introduced to eliminate problems normally associated with referencing the location of data after the data has been moved, using the principal idea that a data set is uniquely identifiable.
Abstract: In a computer having one or more secondary storage devices attached thereto, a Finite Data Environment Processor (FDEP) manages Data Sets residing on the secondary storage devices and in memory using Set Lists (SLs) and General Record Pointers (GRPs). The Data Sets contain either data or logical organizational information. The Set Lists comprise Data Sets organized into a hierarchy by listing a identifier for each of the data sets with a corresponding identifier for the logical parent of that data set. These set lists are also data sets and can be identified as child or parent in a set list. The General Record Pointers identify information in terms of Data Sets and records within them. Using the principal idea that a Data Set is uniquely identifiable, the present invention eliminates problems normally associated with referencing the location of data after the data has been moved.