Grouping Dataframe Based On Consecutive Occurrence Of Values
I have a pandas array which has one column which is either true or false (titled 'condition' in the example below). I would like to group the array by consecutive true or false val
Solution 1:
Since you're dealing with 0/1s, here's another alternative using diff
+ cumsum
-
df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) +1
df
condition H t group
index
012.01.11117.01.51201.00.92306.51.62417.01.13519.01.836122.02.03
If you don't mind floats, this can be made a little faster.
df['group'] = df.condition.diff().abs().cumsum() +1
df.loc[0, 'group'] =1
df
index condition H t group0012.01.11.01117.01.51.02201.00.92.03306.51.62.04417.01.13.05519.01.83.066122.02.03.0
Here's the version with numpy equivalents -
df['group'] =1
df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) +1
df
condition H t group
index
012.01.11117.01.51201.00.92306.51.62417.01.13519.01.836122.02.03
On my machine, here are the timings -
df = pd.concat([df] *100000, ignore_index=True)
%timeit df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) +110 loops, best of3: 25.1 ms per loop
%%timeit
df['group'] = df.condition.diff().abs().cumsum() +1
df.loc[0, 'group'] =110 loops, best of3: 23.4 ms per loop
%%timeit
df['group'] =1
df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) +110 loops, best of3: 21.4 ms per loop
%timeit df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
100 loops, best of 3: 15.8 ms per loop
Solution 2:
Compare with ne
(!=
) by shift
ed column and then use cumsum
:
df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
print (df)
condition H t group
index
0 1 2.0 1.1 1
1 1 7.0 1.5 1
2 0 1.0 0.9 2
3 0 6.5 1.6 2
4 1 7.0 1.1 3
5 1 9.0 1.8 3
6 1 22.0 2.0 3
Detail:
print (df['condition'].ne(df['condition'].shift()))
index
0True1False2True3False4True5False6False
Name: condition, dtype: bool
Timings:
df = pd.concat([df]*100000).reset_index(drop=True)
In [54]: %timeit df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
100 loops, best of 3: 12.2 ms per loop
In [55]: %timeit df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) + 1
10 loops, best of 3: 24.5 ms per loop
In [56]: %%timeit
...: df['group'] = 1
...: df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) + 1
...:
10 loops, best of 3: 26.6 ms per loop
Post a Comment for "Grouping Dataframe Based On Consecutive Occurrence Of Values"