Skip to content Skip to sidebar Skip to footer

How To Remove, Randomly, Rows From A Dataframe But From Each Label?

This is for a machine learning project. I have a dataframe with 5 columns as features and 1 column as label (Figure A). I want to randomly remove 2 rows but from each label. So, a

Solution 1:

With groupby.apply:

df.groupby('label', as_index=False).apply(lambda x: x.sample(2)) \
                                   .reset_index(level=0, drop=True)
Out: 
           01234  label
s1  0.4337310.8866220.6839930.1259180.3987871
s1  0.7198340.4359710.9357420.8857790.4606931
s2  0.3248770.9624130.3662740.9809350.4878062
s2  0.6003180.6335740.4530030.2911590.2236622
s3  0.7411160.1679920.5133740.4851320.5504673
s3  0.3019590.8435310.6543430.7267790.5944023

A cleaner way in my opinion would be with a comprehension:

pd.concat(g.sample(2) for idx, g in df.groupby('label'))

which would yield the same result:

01234label
s1  0.4422930.4703180.5597640.8297430.1469711
s1  0.6032350.2182690.5164220.2953420.4664751
s2  0.5694280.1094940.0357290.5485790.7606982
s2  0.6003180.6335740.4530030.2911590.2236622
s3  0.4127500.0795040.4332720.1361080.7403113
s3  0.4626270.0253280.2458630.9318570.5769273

Solution 2:

Here is a pretty straightforward way. Mix up all the rows with sample(frac=1) and then find the cumulative count for each label and select those with values 1 or less.

df.loc[df.sample(frac=1).groupby('label').cumcount() <= 1]

And here it is with sklearn's stratified kfold. Example taken from here

from sklearn.model_selection import StratifiedKFold
X = df[[0,1,2,3,4]]
y = df.label
skf = StratifiedKFold(n_splits=2)

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y[train_index], y[test_index]

print(X_train)

          0123400.6562400.9040320.2560670.9162930.26277310.5265090.5556830.6677560.2088310.69943840.0964990.6887370.3286700.2607330.83409150.3201500.6021970.7934040.9112910.26991580.9136690.1718310.5344180.8625830.99456190.7183370.2563510.3488130.4209520.622890print(y_train)

011142528393

Post a Comment for "How To Remove, Randomly, Rows From A Dataframe But From Each Label?"