How To Remove, Randomly, Rows From A Dataframe But From Each Label?
This is for a machine learning project. I have a dataframe with 5 columns as features and 1 column as label (Figure A). I want to randomly remove 2 rows but from each label. So, a
Solution 1:
With groupby.apply:
df.groupby('label', as_index=False).apply(lambda x: x.sample(2)) \
.reset_index(level=0, drop=True)
Out:
01234 label
s1 0.4337310.8866220.6839930.1259180.3987871
s1 0.7198340.4359710.9357420.8857790.4606931
s2 0.3248770.9624130.3662740.9809350.4878062
s2 0.6003180.6335740.4530030.2911590.2236622
s3 0.7411160.1679920.5133740.4851320.5504673
s3 0.3019590.8435310.6543430.7267790.5944023
A cleaner way in my opinion would be with a comprehension:
pd.concat(g.sample(2) for idx, g in df.groupby('label'))
which would yield the same result:
01234label
s1 0.4422930.4703180.5597640.8297430.1469711
s1 0.6032350.2182690.5164220.2953420.4664751
s2 0.5694280.1094940.0357290.5485790.7606982
s2 0.6003180.6335740.4530030.2911590.2236622
s3 0.4127500.0795040.4332720.1361080.7403113
s3 0.4626270.0253280.2458630.9318570.5769273
Solution 2:
Here is a pretty straightforward way. Mix up all the rows with sample(frac=1)
and then find the cumulative count for each label and select those with values 1 or less.
df.loc[df.sample(frac=1).groupby('label').cumcount() <= 1]
And here it is with sklearn's stratified kfold. Example taken from here
from sklearn.model_selection import StratifiedKFold
X = df[[0,1,2,3,4]]
y = df.label
skf = StratifiedKFold(n_splits=2)
for train_index, test_index in skf.split(X, y):
X_train, X_test = X.loc[train_index], X.loc[test_index]
y_train, y_test = y[train_index], y[test_index]
print(X_train)
0123400.6562400.9040320.2560670.9162930.26277310.5265090.5556830.6677560.2088310.69943840.0964990.6887370.3286700.2607330.83409150.3201500.6021970.7934040.9112910.26991580.9136690.1718310.5344180.8625830.99456190.7183370.2563510.3488130.4209520.622890print(y_train)
011142528393
Post a Comment for "How To Remove, Randomly, Rows From A Dataframe But From Each Label?"