Skip to content Skip to sidebar Skip to footer

Pandas: Apply Function To Each Pair Of Columns

Function f(x,y) that takes two Pandas Series and returns a floating point number. I would like to apply f to each pair of columns in a DataFrame D and construct another DataFrame E

Solution 1:

You can avoid explicit loops by using Numpy's broadcasting.

Combined with np.vectorize() and an explicit signature, that gives us the following:

vf = np.vectorize(f, signature='(n),(n)->()')
result = vf(D.T.values, D.T.values[:, None])

Notes:

  1. you can add some print statement (e.g. print(f'x:\n{x}\ny:\n{y}\n')) in your function, to convince yourself it is doing the right thing.
  2. you function f() is symmetric; if it is not (e.g. def f(x, y): return np.linalg.norm(x - y**2)), which argument is extended with an extra dimension for broadcasting matters. With the expression above, you'll get the same result as you r E. If instead you use result = vf(D.T.values[:, None], D.T.values), then you'll get its transpose.
  3. the result is a numpy array, of course, and if you want it back as a DataFrame, add:
df = pd.DataFrame(result, index=D.columns, columns=D.columns)

BTW, if f() is really the one from your toy example, as I'm sure you already know, you can directly write:

df = D.T.dot(D)

Performance:

Performance-wise, the speed-up using broadcasting and vectorize is roughly 10x (stable over various matrix sizes). By contrast, D.T.dot(D) is more than 700x faster for size (100, 100), but critically it seems that the relative speedup gets even higher with larger sizes (up to 12,000x faster in my tests, for size (200, 1000) resulting in 1M loops). So, as usual, there is a strong incentive to try and find a way to implement your function f() using existing numpy function(s)!

Post a Comment for "Pandas: Apply Function To Each Pair Of Columns"