One of the challenges in merging administrative datasets is that different datasets will often include records about the same entity (e.g., the same individual), but matching between these records and merging across disparate datasets, or even identifying linked records within the same dataset, can be challenging. This is because often the data may not include sufficient information to unambiguously identify individuals. Even when unambiguous data (i.e., uniquely idenfitying data, such as full name and date of birth) is recorded, often the records can contain small errors in the way in which the identity of the entity is stored, making automated linkage difficult.
recordlinkage
Python libraryThis library includes implementations of functions that can hep us to identify and link between records that pertain to the same individual, even in the face of the errors that typically hamper these linkages.
As it is, if you try to import the recordlinkage
library, you will run into problems:
import recordlinkage
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-93ec48a7ca10> in <module>()
----> 1 import recordlinkage
ModuleNotFoundError: No module named 'recordlinkage'
This means that the module is not installed on this machine.
To install external libraries, we can use the Python package manager pip
.
The pip
package manager goes and finds the software in the Python package index, downloads it, figures it what all its dependencies are, installs them and then finally installs the software and makes it available. In Azure, we can do this as follows:
!pip install recordlinkage
Now the import
statement works!
The process of record linkage includes several steps:
This means that we create suggested pairs that we will check. One way to do that is to create all-to-all pairings:
df = pd.read_csv('https://raw.githubusercontent.com/uwescience/ds4ad/master/data/synthetic_data.csv')
indexer = recordlinkage.FullIndex()
pairs = indexer.index(df)
But this creates a very large number of pairs to compare:
len(pairs)
12497500
So we would like to use some bit of information that we have to first create putative pairs. One way to do this is called blocking. This means that we create an index that “blocks on” one of the variables that we think would make a good initial guess for potential matches:
indexer = recordlinkage.BlockIndex(on='given_name')
pairs = indexer.index(df)
print (len(pairs))
48883
That’s a much more manageable number
To compare pairs, we will create a Compare
object:
compare_cl = recordlinkage.Compare()
We can add various properties to the Compare
object , requiring various kinds of matches:
compare_cl.exact('given_name', 'given_name', label='given_name')
compare_cl.string('surname', 'surname', method='jarowinkler', threshold=0.85, label='surname')
compare_cl.date('date_of_birth', 'date_of_birth', label='date_of_birth')
compare_cl.exact('suburb', 'suburb', label='suburb')
compare_cl.exact('state', 'state', label='state')
compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')
From this we can calculate what we call features
:
features = compare_cl.compute(pairs, df)
This is a DataFrame
with two indices. For example:
features.loc[0, 13]
Refers to the potential matching of index 0
and index 13
from the original table (df
).
To match between records in the features
table, we need to somehow set a criterion for what
we will call a match. For example, we can say that we require more than three matches in the
criteria we defined above:
matches = features[features.sum(axis=1) > 3]
For example, this matching process thinks that index 113
and index 1231
are the same person. Let’s see if this makes sense:
df.loc[113]
rec_id rec-2254-dup-2
given_name sarah
surname webb
street_number 39
address_1 mcfarlanplace
address_2 t int-sec ngarden
suburb coomaleidgup
postcode 6023
state nsw
date_of_birth 1.99707e+07
soc_sec_id 8244083
gender female
salary 44817
Name: 113, dtype: object
df.loc[1231]
rec_id rec-2254-org
given_name sarah
surname webb
street_number 39
address_1 mcfarlan place
address_2 t int-sect garden
suburb coomalbidgup
postcode 6023
state nsw
date_of_birth 1.99707e+07
soc_sec_id 8244083
gender female
salary 44671
Name: 1231, dtype: object
This makes sense!
Can you find problems with this linkage? What would you do improve the linkage?