Function For Matching Values In Multiple Columns
Solution 1:
You can use the threshold for each pair of your columns, then sum up the resulting boolean columns to obtain the number you need. Note, however, that this number depends on the order in which you compare columns. This ambiguity would be gone if you used abs(df['A']-df['B'])
etc, and this might very well be your intention. Below I'll assume this is what you need.
Generally, you can use itertools.combinations
to produce each pair of columns once:
from itertools import combinations
df = pd.DataFrame(np.random.randn(12, 3), columns=['A', 'B', 'C'])
thresh = .3
df['matches'] = sum(abs(df[k1]-df[k2])<thresh for k1,k2 in combinations(df.keys(),2))
The generator expression in the sum()
loops over every column pair, and constructs the respective boolean vector. These are summed for each column pair, and the resulting column is appended to the dataframe.
Example output for thresh = 0.3
:
A B C matches
0 0.146360 -0.099707 0.633632 1
1 1.462810 -0.186317 -1.411988 0
2 0.358827 -0.758619 0.038329 0
3 0.077122 -0.213856 -0.619768 1
4 0.215555 1.930888 -0.488517 0
5 -0.946557 -0.904743 -0.004738 1
6 -0.080209 -0.850830 -0.866865 1
7 -0.997710 -0.580679 -2.231168 0
8 1.762313 -0.356464 -1.813028 0
9 1.151338 0.347636 -1.323791 0
10 0.248432 1.265484 0.048484 1
11 0.559934 -0.401059 0.863616 0
Using itertools.combinations
, the columns are compared as
>>> [k for k in itertools.combinations(df.keys(),2)]
('A', 'B'), ('A', 'C'), ('B', 'C')]
but this really doesn't matter if you're using the absolute value (since then the difference is symmetric with respect to columns).
Solution 2:
Try this guy:
df2['matches'] = df2.apply(lambda x: sum([x[i] - x[j] <= thresh for i, j in [(0, 1), (0, 2), (1, 2)]]), axis=1)
It could be generalized to any number of columns if necessary.
Solution 3:
Here's a way to do it:
df2 = pd.DataFrame(np.random.randn(12, 3), columns=['A', 'B', 'C'])
thresh = 0.3
newcol = []
for row in df2.iterrows():
newcol.append(sum([v > thresh for v in list(row[1])]))
df2['matches'] = newcol
Post a Comment for "Function For Matching Values In Multiple Columns"