Skip to content Skip to sidebar Skip to footer

Grouping Dataframe Based On Consecutive Occurrence Of Values

I have a pandas array which has one column which is either true or false (titled 'condition' in the example below). I would like to group the array by consecutive true or false val

Solution 1:

Since you're dealing with 0/1s, here's another alternative using diff + cumsum -

df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) +1    
df

       condition     H    t  group
index                             
012.01.11117.01.51201.00.92306.51.62417.01.13519.01.836122.02.03

If you don't mind floats, this can be made a little faster.

df['group'] = df.condition.diff().abs().cumsum() +1
df.loc[0, 'group'] =1
df

   index  condition     H    t  group0012.01.11.01117.01.51.02201.00.92.03306.51.62.04417.01.13.05519.01.83.066122.02.03.0

Here's the version with numpy equivalents -

df['group'] =1
df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) +1
df


       condition     H    t  group
index                             
012.01.11117.01.51201.00.92306.51.62417.01.13519.01.836122.02.03

On my machine, here are the timings -

df = pd.concat([df] *100000, ignore_index=True)

%timeit df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) +110 loops, best of3: 25.1 ms per loop

%%timeit
df['group'] = df.condition.diff().abs().cumsum() +1
df.loc[0, 'group'] =110 loops, best of3: 23.4 ms per loop

%%timeit
df['group'] =1
df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) +110 loops, best of3: 21.4 ms per loop
%timeit df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
100 loops, best of 3: 15.8 ms per loop

Solution 2:

Compare with ne (!=) by shifted column and then use cumsum:

df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
print (df)
       condition     H    t  group
index                             
0              1   2.0  1.1      1
1              1   7.0  1.5      1
2              0   1.0  0.9      2
3              0   6.5  1.6      2
4              1   7.0  1.1      3
5              1   9.0  1.8      3
6              1  22.0  2.0      3

Detail:

print (df['condition'].ne(df['condition'].shift()))
index
0True1False2True3False4True5False6False
Name: condition, dtype: bool

Timings:

df = pd.concat([df]*100000).reset_index(drop=True)


In [54]: %timeit df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
100 loops, best of 3: 12.2 ms per loop

In [55]: %timeit df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) + 1
10 loops, best of 3: 24.5 ms per loop

In [56]: %%timeit
    ...: df['group'] = 1
    ...: df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) + 1
    ...: 
10 loops, best of 3: 26.6 ms per loop

Post a Comment for "Grouping Dataframe Based On Consecutive Occurrence Of Values"