Find Indices Of Duplicate Rows In Pandas Dataframe
What is the pandas way of finding the indices of identical rows within a given DataFrame without iterating over individual rows? While it is possible to find all unique rows with u
Solution 1:
Use parameter duplicated
with keep=False
for all dupe rows and then groupby
by all columns and convert index values to tuples, last convert output Series
to list
:
df = df[df.duplicated(keep=False)]
df = df.groupby(list(df)).apply(lambda x: tuple(x.index)).tolist()
print (df)
[(1, 6), (2, 4), (3, 5)]
If you want also see duplicate values:
df1 = (df.groupby(df.columns.tolist())
.apply(lambda x: tuple(x.index))
.reset_index(name='idx'))
print (df1)
param_a param_b param_c idx
0000 (1, 6)
1021 (2, 4)
2211 (3, 5)
Solution 2:
Approach #1
Here's one vectorized approach inspired by this post
-
defgroup_duplicate_index(df):
a = df.values
sidx = np.lexsort(a.T)
b = a[sidx]
m = np.concatenate(([False], (b[1:] == b[:-1]).all(1), [False] ))
idx = np.flatnonzero(m[1:] != m[:-1])
I = df.index[sidx].tolist()
return [I[i:j] for i,j inzip(idx[::2],idx[1::2]+1)]
Sample run -
In[42]: dfOut[42]:
param_aparam_bparam_c100020213211402152116000In[43]: group_duplicate_index(df)
Out[43]: [[1, 6], [3, 5], [2, 4]]
Approach #2
For integer numbered dataframes, we could reduce each row to a scalar each and that lets us work with a 1D
array, giving us a more performant one, like so -
def group_duplicate_index_v2(df):
a = df.values
s = (a.max()+1)**np.arange(df.shape[1])
sidx = a.dot(s).argsort()
b = a[sidx]
m = np.concatenate(([False], (b[1:] == b[:-1]).all(1), [False] ))
idx = np.flatnonzero(m[1:] != m[:-1])
I = df.index[sidx].tolist()
return [I[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]
Runtime test
Other approach(es) -
def groupby_app(df): # @jezrael's solndf = df[df.duplicated(keep=False)]
df = df.groupby(df.columns.tolist()).apply(lambda x: tuple(x.index)).tolist()
returndf
Timings -
In [274]: df = pd.DataFrame(np.random.randint(0,10,(100000,3)))
In [275]: %timeit group_duplicate_index(df)
10 loops, best of 3: 36.1 ms per loop
In [276]: %timeit group_duplicate_index_v2(df)
100 loops, best of 3: 15 ms per loop
In [277]: %timeit groupby_app(df) # @jezrael's soln
10 loops, best of 3: 25.9 ms per loop
Post a Comment for "Find Indices Of Duplicate Rows In Pandas Dataframe"