Select Sample Random Groups After Groupby In Pandas?
I have a very large DataFrame that looks like this example df: df = col1 col2 col3 apple red 2.99 apple red 2.99 apple red 1.99 apple pink
Solution 1:
You can do with shuffle
and ngroup
g = df.groupby(['col1', 'col2'])
a=np.arange(g.ngroups)
np.random.shuffle(a)
df[g.ngroup().isin(a[:2])]# change 2 to what you need :-)
Solution 2:
Shuffle your dataframe using sample
, and then perform a non-sorting groupby
:
df = df.sample(frac=1)
df2 = pd.concat(
[g for _, g in df.groupby(['col1', 'col2'], sort=False, as_index=False)][:3],
ignore_index=True
)
If you need the first 3 per group, use groupby.head(3)
;
df2 = pd.concat(
[g.head(3) for _, g in df.groupby(['col1', 'col2'], sort=False, as_index=False)][:3],
ignore_index=True
)
Solution 3:
In cases where you need to do this type of sampling in only one column, this is also an alternative:
df.loc[df['col1'].isin(pd.Series(df['col1'].unique()).sample(2))]
longer:
>>>import pandas as pd>>>import numpy as np>>>df = pd.DataFrame({'col1':['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
'col2': np.random.randint(5, size=9),
'col3': np.random.randint(5, size=9)
})
>>>df
col1 col2 col3
0 a 4 3
1 a 3 0
2 a 4 0
3 b 4 4
4 b 4 1
5 b 1 3
6 c 4 4
7 c 3 2
8 c 3 1
>>>sample = pd.Series(df['col1'].unique()).sample(2)>>>sample
0 b
1 c
dtype: object
>>>df.loc[df['col1'].isin(sample)]
col1 col2 col3
3 b 4 4
4 b 4 1
5 b 1 3
6 c 4 4
7 c 3 2
8 c 3 1
Solution 4:
This is one way:
from io import StringIO
import pandas as pd
import numpy as np
np.random.seed(100)
data = """
col1 col2 col3
apple red 2.99
apple red 2.99
apple red 1.99
apple pink 1.99
apple pink 1.99
apple pink 2.99
pear green .99
pear green .99
pear green 1.29
"""# Number of groups
K = 2
df = pd.read_table(StringIO(data), sep=' ', skip_blank_lines=True, skipinitialspace=True)
# Use columns as indices
df2 = df.set_index(['col1', 'col2'])
# Choose random sample of indices
idx = np.random.choice(df2.index.unique(), K, replace=False)
# Select
selection = df2.loc[idx].reset_index(drop=False)
print(selection)
Output:
col1 col2 col3
0 apple pink 1.99
1 apple pink 1.99
2 apple pink 2.99
3 pear green 0.99
4 pear green 0.99
5 pear green 1.29
Solution 5:
I turned @Arvid Baarnhielm's answer into a simple function
def sampleCluster(df:pd.DataFrame, columnCluster:str, fraction) -> pd.DataFrame:
return df.loc[df[columnCluster].isin(pd.Series(df[columnCluster].unique()).sample(frac=fraction))]
Post a Comment for "Select Sample Random Groups After Groupby In Pandas?"