Skip to content Skip to sidebar Skip to footer

How Can I Compare Different Rows Of One Column With Levenshtein Distance Metric In Pandas?

I have a table like this: id name 1 gfh 2 bob 3 boby 4 hgf etc. I am wondering how can I use Levenshtein metric to compare different rows of my 'name' column? I already know that

Solution 1:

Here is a way to do it with pandas and numpy:

from numpy import triu, ones
t = """id name
1 gfh
2 bob
3 boby
4 hgf"""

df = pd.read_csv(pd.core.common.StringIO(t), sep='\s{1,}').set_index('id')
print df

        name
id1    gfh
2    bob
3   boby
4    hgf

Create dataframe with list of strings to mesure distance:

dfs = pd.DataFrame([df.name.tolist()] * df.shape[0], index=df.index, columns=df.index)
dfs = dfs.applymap(lambda x: list([x]))
print dfs

    id1234id1   [gfh]  [bob]  [boby]  [hgf]
2   [gfh]  [bob]  [boby]  [hgf]
3   [gfh]  [bob]  [boby]  [hgf]
4   [gfh]  [bob]  [boby]  [hgf]

Mix lists to form a matrix with all variations and make upper right corner as NaNs:

dfd = dfs + dfs.T
dfd = dfd.mask(triu(ones(dfd.shape)).astype(bool))
print dfd

id            1234
id                                            
1NaNNaNNaNNaN2[gfh, bob]NaNNaNNaN3[gfh, boby][bob, boby]NaNNaN4[gfh, hgf][bob, hgf][boby, hgf]NaN

Measure L.distance:

dfd.applymap(lambda x: L.distance(x[0], x[1]))

Solution 2:

Maybe by comparing each value one to the other and storing the whole combination results.

Naively coded, something like

input_data = ["gfh", "bob", "body", "hgf"]
data_len = len(input_data)
output_results = {}

for i inrange(data_len):
    word_1 = input_data[i]
    for j inrange(data_len):
        if(j == i): #skip self comparisoncontinue
        word_2 = input_data[j]
        #compute your distance
        output_results[(word_1, word_2)] = L.distance(word_1, word_2)

And then do what you want with output_results

Post a Comment for "How Can I Compare Different Rows Of One Column With Levenshtein Distance Metric In Pandas?"