Home
This Title All WIREs
WIREs RSS Feed
How to cite this WIREs title:
WIREs Comp Stat

Regression with linked datasets subject to linkage error

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

Abstract Data are often collected from multiple heterogeneous sources and are combined subsequently. In combing data, record linkage is an essential task for linking records in datasets that refer to the same entity. Record linkage is generally not error‐free; there is a possibility that records belonging to different entities are linked or that records belonging to the same entity are missed. It is not advisable to simply ignore such errors because they can lead to data contamination and introduce bias in sample selection or estimation, which, in return, can lead to misleading statistical results and conclusions. For a long while, this problem was not properly recognized, but in recent years a growing number of researchers have developed methodology for dealing with linkage errors in regression analysis with linked datasets. The main goal of this overview is to give an account of those developments, with an emphasis on recent approaches and their connection to the so‐called “Broken Sample” problem. We also provide a short empirical study that illustrates the efficacy of corrective methods in different scenarios. This article is categorized under: Statistical Models > Model Selection Statistical and Graphical Methods of Data Analysis > Robust Methods Statistical and Graphical Methods of Data Analysis > Multivariate Analysis
The setting of regression based on two linked files A and B. The response variable weeks_unemployed (duration of unemployment in weeks) is contained in File B, while the two potential predictor variables Edu (education level) and Salary (past monthly salary in US$) are contained in File A. The variables Age and Sex are potential predictor variables contained in both files. In combination with ZIP (zip code of home address), these three variables can thus be used as matching variables in record linkage. Possible sources of mismatch error corresponding to records with non‐unique combinations of matching variables are highlighted via gray shading and framed boxes, respectively. Reproduced from Wang et al. (2020)
[ Normal View | Magnified View ]
Comparison of two‐stage approach for estimation of for corresponding data in Table 1. Top row: fitted values based on “mismatch‐free” data vs. fitted values based on data with mismatches . Middle row: fitted values based on “mismatch‐free” data vs. fitted values based on the corrected data with denoting the solution of optimization problem (19). Bottom row: QQ plots of the absolute differences between the true responses and their fitted values based on the oracle estimator vs. the absolute mismatch errors in the merged file (dots) and their counterparts after correction (triangles) based on (19). The idea underlying the bottom plots is that after correction, the remaining mismatch error is supposed to exhibit a similar distribution as the noise of the regression model, here approximated by the distribution of the residuals associated with the oracle estimator
[ Normal View | Magnified View ]
Comparison of approaches without knowledge of blocks. Each maker corresponds to different approaches given above. Each point in the graphs represents an average of the given metric over independent replications. The number of replications equals 100 for both ISD and CPS and 20 for END given the large size of the dataset
[ Normal View | Magnified View ]
Fictitious example dataset based on the Italian household survey discussed in Tancredi and Liseo (2015). Here, the monthly household income (in 1000 Euros) in 2010 is regressed on the same quantity in 2008. Left: scatterplot and estimated regression line in the absence of mismatch error (“oracle”). Middle: scatterplot and regression line (gray) after linking two files containing the records for 2008 and 2010, respectively. Mismatches are represented by , correct matches by . Right: summary of the linear regression fits corresponding to the left and middle plot; denotes the coefficient of determination
[ Normal View | Magnified View ]

Browse by Topic

Statistical and Graphical Methods of Data Analysis > Multivariate Analysis
Statistical and Graphical Methods of Data Analysis > Robust Methods
Statistical Models > Model Selection

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts