Pairwise Euclidean Distance With Pandas Ignoring Nans

September 16, 2024 Post a Comment

I start with a dictionary, which is the way my data was already formatted: import pandas as pd dict2 = {'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0}, 'C':{'b'

Solution 1:

You can use numpy broadcasting to compute vectorised Euclidean distance (L2-norm), ignoring NaNs using np.nansum.

i = df.values.T
j = np.nansum((i - i[:, None]) ** 2, axis=2) ** .5

If you want a DataFrame representing a distance matrix, here's what that would look like:

df =(lambda v,c: pd.DataFrame(v,c,c))(j, df.columns)
df
          A         B    C
A  0.0000001.4142141.0
B  1.4142140.0000001.0
C  1.0000001.0000000.0

df[i, j] represents the distance between the i and j column in the original DataFrame.

Solution 2:

The code below iterates through columns to calculate the difference.

# Import librariesimport pandas as pd
import numpy as np

# Create dataframe
df = pd.DataFrame({'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0},'C':{'b':1.0,'c':2.0, 'd':4.0}})
df2 = pd.DataFrame()

# Calculate difference
clist = df.columns
for i inrange (0,len(clist)-1):
    for j inrange (1,len(clist)):
        if (clist[i] != clist[j]):
            var = clist[i] + '-' + clist[j]
            df[var] = abs(df[clist[i]] - df[clist[j]]) # optional
            df2[var] = abs(df[clist[i]] - df[clist[j]]) # optional