Skip to main content

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records

Since most social science research relies upon multiple data
sources, merging data sets is an essential part of workflow for many
researchers. In many situations, however, a unique identifier that
unambiguously links data sets is unavailable and data sets may
contain missing and inaccurate information. As a result,
researchers can no longer combine data sets ``by hand'' without
sacrificing the quality of the resulting merged data set. This
problem is especially severe when merging large-scale administrative
records such as voter files. The existing algorithms to automate the
merging process do not scale, result in many fewer matches, and
require arbitrary decisions by researchers. To overcome this
challenge, we develop a fast algorithm to implement the canonical
probabilistic model of record linkage for merging large data sets.
Researchers can combine this model with a small amount of human
coding to produce a high-quality merged data set. The proposed
methodology can handle millions of observations and account for
missing data and auxiliary information. We conduct simulation
studies to show that our algorithm performs well in a variety of
practically relevant settings. Finally, we use our methodology to
merge the campaign contribution data (5 million records), the
Cooperative Congressional Election Study data (50 thousand records),
and the nationwide voter file (160 million records).