Select Sample Random Groups After Groupby In Pandas?

December 27, 2023 Post a Comment

I have a very large DataFrame that looks like this example df: df = col1 col2 col3 apple red 2.99 apple red 2.99 apple red 1.99 apple pink

Solution 1:

You can do with shuffle and ngroup

g = df.groupby(['col1', 'col2'])

a=np.arange(g.ngroups)
np.random.shuffle(a)

df[g.ngroup().isin(a[:2])]# change 2 to what you need :-)

Solution 2:

Shuffle your dataframe using sample, and then perform a non-sorting groupby:

df = df.sample(frac=1)
df2 = pd.concat(
    [g for _, g in df.groupby(['col1', 'col2'], sort=False, as_index=False)][:3],
    ignore_index=True 
)

If you need the first 3 per group, use groupby.head(3);

df2 = pd.concat(
    [g.head(3) for _, g in df.groupby(['col1', 'col2'], sort=False, as_index=False)][:3],
    ignore_index=True 
)

Solution 3:

In cases where you need to do this type of sampling in only one column, this is also an alternative:

df.loc[df['col1'].isin(pd.Series(df['col1'].unique()).sample(2))]

longer:

>>>import pandas as pd>>>import numpy as np>>>df = pd.DataFrame({'col1':['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
                      'col2': np.random.randint(5, size=9),
                      'col3': np.random.randint(5, size=9)
                     })
>>>df
  col1  col2  col3
0    a     4     3
1    a     3     0
2    a     4     0
3    b     4     4
4    b     4     1
5    b     1     3
6    c     4     4
7    c     3     2
8    c     3     1
>>>sample = pd.Series(df['col1'].unique()).sample(2)>>>sample
0    b
1    c
dtype: object
>>>df.loc[df['col1'].isin(sample)]
  col1  col2  col3
3    b     4     4
4    b     4     1
5    b     1     3
6    c     4     4
7    c     3     2
8    c     3     1

Solution 4:

This is one way:

from io import StringIO
import pandas as pd
import numpy as np

np.random.seed(100)

data = """
col1    col2     col3
apple   red      2.99
apple   red      2.99
apple   red      1.99
apple   pink     1.99
apple   pink     1.99
apple   pink     2.99
pear    green     .99
pear    green     .99
pear    green    1.29
"""# Number of groups
K = 2

df = pd.read_table(StringIO(data), sep=' ', skip_blank_lines=True, skipinitialspace=True)
# Use columns as indices
df2 = df.set_index(['col1', 'col2'])
# Choose random sample of indices
idx = np.random.choice(df2.index.unique(), K, replace=False)
# Select
selection = df2.loc[idx].reset_index(drop=False)
print(selection)

Output:

    col1   col2  col3
0  apple   pink  1.99
1  apple   pink  1.99
2  apple   pink  2.99
3   pear  green  0.99
4   pear  green  0.99
5   pear  green  1.29

Solution 5:

I turned @Arvid Baarnhielm's answer into a simple function

def sampleCluster(df:pd.DataFrame, columnCluster:str, fraction) -> pd.DataFrame:
    return df.loc[df[columnCluster].isin(pd.Series(df[columnCluster].unique()).sample(frac=fraction))]

Learn Python Tutorials