Skip to content Skip to sidebar Skip to footer

How Can I Create An Artificial Key Column For Merging Two Datasets Using Difflab When The Column Of Interest Has Missing Cells?

Goal: If the name in df2 in row i is a sub-string or an exact match of a name in df1 in some row N and the state and district columns of row N in df1 are a match to the respective

Solution 1:

You are getting a list type object back. And these lists dont have index 0. Thats why you get this error. Second of all, we need to convert these lists to type string to be able to do the merge like following:

note: you dont have to use: df1['CandidateName'] = df1['CandidateName'].replace('', 'EMPTY')

import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: ''.join(difflib.get_close_matches(x, df2['Name'])))

df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'], how='left')

print(df_merge)
              CandidateName State  District     Party              Name
0       Theodorick A. Bland    VA         9            Theodorick Bland
1  Aedanus Rutherford Burke    SC         2               Aedanus Burke
2               Jason Lewis    MN         2                 Jason Lewis
3         Barbara  Comstock    VA        10  Democrat  Barbara Comstock
4          Theodorick Bland    VA         9            Theodorick Bland
5             Aedanus Burke    SC         2               Aedanus Burke
6       Jason Initial Lewis    MN         2  Democrat       Jason Lewis
7                              NH         1      Whig                  
8                              NH         1      Whig                

Note I added how='left' argument to our merge since you want to keep the shape of your original dataframe.

Explanation of ''.join() We do this to convert the list to string, see example:

lst = ['hello', 'world']

print(' '.join(lst))
'hello world'

Post a Comment for "How Can I Create An Artificial Key Column For Merging Two Datasets Using Difflab When The Column Of Interest Has Missing Cells?"