Home
This Title All WIREs
WIREs RSS Feed
How to cite this WIREs title:
WIREs Comp Stat

Matching and record linkage

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

This overview gives background on a number of statistical methods that have been proven effective for record linkage. To prepare data for the main computational algorithms, we need parsing/standardization that allows us to structure the free‐form names, addresses, and other fields into corresponding components. The main parameter‐estimation methods are unsupervised methods that yield ‘optimal’ record linkage parameters. Extended methods provide estimates of false match rates in both unsupervised and, with greater accuracy, in semi‐supervised situations. Finally, the paper describes ongoing research for adjusting standard statistical analyses for linkage error. WIREs Comput Stat 2014, 6:313–325. doi: 10.1002/wics.1317 This article is categorized under: Statistical and Graphical Methods of Data Analysis > EM Algorithm Algorithms and Computational Methods > Seminumerical and Nonnumerical Methods Data: Types and Structure > Data Preparation and Processing Statistical and Graphical Methods of Data Analysis > Markov Chain Monte Carlo (MCMC)
Plots of log frequencies versus matching weight of nonmatches and matches. INLINE GRAPHICS CIRCLE = nonmatch, * = match, cutoff ‘L’ = 0 and cutoff ‘U’ = 6.
[ Normal View | Magnified View ]
False match rate estimates from three methods applied to three pairs of files. (a) Estimates versus truth, file A cumulative false matches, unsupervised independent EM, λ = 0.2. (b) Estimates versus truth file B cumulative false matches, unsupervised independent EM, λ = 0.2. (c) Estimates versus truth file C cumulative false matches, unsupervised independent EM, λ = 0.2. (d) Estimates versus truth, file A cumulative false match rates, unsupervised Belin‐Rubin procedure. (e) Estimates versus truth, file B cumulative false match rates, unsupervised Belin‐Rubin procedure. (f) Estimates versus truth, file C cumulative false match rates, unsupervised Belin‐Rubin procedure. (g) Estimates versus truth file A cumulative false matches, semi‐supervised small sample, independent EM, λ = 0.99. (h) Estimates versus truth file B cumulative false matches, semi‐supervised small sample, independent EM, λ = 0.99. (I) Estimates versus truth file C cumulative false matches, semi‐supervised small sample, independent EM, λ = 0.99.
[ Normal View | Magnified View ]

Browse by Topic

Algorithms and Computational Methods > Seminumerical and Nonnumerical Methods
Statistical and Graphical Methods of Data Analysis > Markov Chain Monte Carlo (MCMC)
Statistical and Graphical Methods of Data Analysis > EM Algorithm
Data: Types and Structure > Data Preparation and Processing

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts