How To Obtain Reproducible But Distinct Instances Of Groupkfold
Solution 1:
KFoldis only randomized ifshuffle=True. Some datasets should not be shuffled.GroupKFoldis not randomized at all. Hence therandom_state=None.GroupShuffleSplitmay be closer to what you're looking for.
A comparison of the group-based splitters:
- In
GroupKFold, the test sets form a complete partition of all the data. LeavePGroupsOutleaves all possible subsets of P groups out, combinatorially; test sets will overlap for P > 1. Since this meansP ** n_groupssplits altogether, often you want a small P, and most often wantLeaveOneGroupOutwhich is basically the same asGroupKFoldwithk=1.GroupShuffleSplitmakes no statement about the relationship between successive test sets; each train/test split is performed independently.
As an aside,
Dmytro Lituiev has proposed an alternative GroupShuffleSplit algorithm which is better at getting the right number of samples (not merely the right number of groups) in the test set for a specified test_size.
Solution 2:
Inspired by user0's answer (can't comment) but faster:
defRandomGroupKFold_split(groups, n, seed=None): # noqa: N802"""
Random analogous of sklearn.model_selection.GroupKFold.split.
:return: list of (train, test) indices
"""
groups = pd.Series(groups)
ix = np.arange(len(groups))
unique = np.unique(groups)
np.random.RandomState(seed).shuffle(unique)
result = []
for split in np.array_split(unique, n):
mask = groups.isin(split)
train, test = ix[~mask], ix[mask]
result.append((train, test))
return result
Solution 3:
My solution so far has been to simply randomly split the groups. This could lead to very unbalanced groups (which I think GroupKFold was designed to ward off), but the hope is that the number of observations per group is small.
from sklearn.utils import shuffle
from sklearn.model_selection import GroupKFold
from numpy.random import RandomState
import numpy as np
import sys
import pdb
random_state = int(sys.argv[1])
X = np.arange(20).reshape((10,2))
y = np.arange(10)
groups = np.array([0,0,0,1,2,3,4,5,6,7])
for el inzip(range(len(y)),X,y,groups):
print"ix, X, y, groups", el
defRandGroupKfold(groups, n_splits, random_state=None, shuffle_groups=False):
ix = np.array(range(len(groups)))
unique_groups = np.unique(groups)
if shuffle_groups:
prng = RandomState(random_state)
prng.shuffle(unique_groups)
splits = np.array_split(unique_groups, n_splits)
train_test_indices = []
for split in splits:
mask = [el in split for el in groups]
train = ix[np.invert(mask)]
test = ix[mask]
train_test_indices.append((train, test))
return train_test_indices
splits = RandGroupKfold(groups, n_splits=3, random_state=random_state, shuffle_groups=True)
for train, test in splits:
print"---"for el inzip(train, X[train], y[train], groups[train]):
print"train ix, X, y, groups", el
for el inzip(test, X[test], y[test], groups[test]):
print"test ix, X, y, groups", el
Data:
ix, X, y, groups (0, array([0, 1]), 0, 0)
ix, X, y, groups (1, array([2, 3]), 1, 0)
ix, X, y, groups (2, array([4, 5]), 2, 0)
ix, X, y, groups (3, array([6, 7]), 3, 1)
ix, X, y, groups (4, array([8, 9]), 4, 2)
ix, X, y, groups (5, array([10, 11]), 5, 3)
ix, X, y, groups (6, array([12, 13]), 6, 4)
ix, X, y, groups (7, array([14, 15]), 7, 5)
ix, X, y, groups (8, array([16, 17]), 8, 6)
ix, X, y, groups (9, array([18, 19]), 9, 7)
Random state as 4
---
train ix, X, y, groups (0, array([0, 1]), 0, 0)
train ix, X, y, groups (1, array([2, 3]), 1, 0)
train ix, X, y, groups (2, array([4, 5]), 2, 0)
train ix, X, y, groups (3, array([6, 7]), 3, 1)
train ix, X, y, groups (4, array([8, 9]), 4, 2)
train ix, X, y, groups (7, array([14, 15]), 7, 5)
train ix, X, y, groups (8, array([16, 17]), 8, 6)
test ix, X, y, groups (5, array([10, 11]), 5, 3)
test ix, X, y, groups (6, array([12, 13]), 6, 4)
test ix, X, y, groups (9, array([18, 19]), 9, 7)
---
train ix, X, y, groups (4, array([8, 9]), 4, 2)
train ix, X, y, groups (5, array([10, 11]), 5, 3)
train ix, X, y, groups (6, array([12, 13]), 6, 4)
train ix, X, y, groups (8, array([16, 17]), 8, 6)
train ix, X, y, groups (9, array([18, 19]), 9, 7)
test ix, X, y, groups (0, array([0, 1]), 0, 0)
test ix, X, y, groups (1, array([2, 3]), 1, 0)
test ix, X, y, groups (2, array([4, 5]), 2, 0)
test ix, X, y, groups (3, array([6, 7]), 3, 1)
test ix, X, y, groups (7, array([14, 15]), 7, 5)
---
train ix, X, y, groups (0, array([0, 1]), 0, 0)
train ix, X, y, groups (1, array([2, 3]), 1, 0)
train ix, X, y, groups (2, array([4, 5]), 2, 0)
train ix, X, y, groups (3, array([6, 7]), 3, 1)
train ix, X, y, groups (5, array([10, 11]), 5, 3)
train ix, X, y, groups (6, array([12, 13]), 6, 4)
train ix, X, y, groups (7, array([14, 15]), 7, 5)
train ix, X, y, groups (9, array([18, 19]), 9, 7)
test ix, X, y, groups (4, array([8, 9]), 4, 2)
test ix, X, y, groups (8, array([16, 17]), 8, 6)
Random state as 5
---
train ix, X, y, groups (0, array([0, 1]), 0, 0)
train ix, X, y, groups (1, array([2, 3]), 1, 0)
train ix, X, y, groups (2, array([4, 5]), 2, 0)
train ix, X, y, groups (3, array([6, 7]), 3, 1)
train ix, X, y, groups (5, array([10, 11]), 5, 3)
train ix, X, y, groups (7, array([14, 15]), 7, 5)
train ix, X, y, groups (8, array([16, 17]), 8, 6)
test ix, X, y, groups (4, array([8, 9]), 4, 2)
test ix, X, y, groups (6, array([12, 13]), 6, 4)
test ix, X, y, groups (9, array([18, 19]), 9, 7)
---
train ix, X, y, groups (4, array([8, 9]), 4, 2)
train ix, X, y, groups (5, array([10, 11]), 5, 3)
train ix, X, y, groups (6, array([12, 13]), 6, 4)
train ix, X, y, groups (8, array([16, 17]), 8, 6)
train ix, X, y, groups (9, array([18, 19]), 9, 7)
test ix, X, y, groups (0, array([0, 1]), 0, 0)
test ix, X, y, groups (1, array([2, 3]), 1, 0)
test ix, X, y, groups (2, array([4, 5]), 2, 0)
test ix, X, y, groups (3, array([6, 7]), 3, 1)
test ix, X, y, groups (7, array([14, 15]), 7, 5)
---
train ix, X, y, groups (0, array([0, 1]), 0, 0)
train ix, X, y, groups (1, array([2, 3]), 1, 0)
train ix, X, y, groups (2, array([4, 5]), 2, 0)
train ix, X, y, groups (3, array([6, 7]), 3, 1)
train ix, X, y, groups (4, array([8, 9]), 4, 2)
train ix, X, y, groups (6, array([12, 13]), 6, 4)
train ix, X, y, groups (7, array([14, 15]), 7, 5)
train ix, X, y, groups (9, array([18, 19]), 9, 7)
test ix, X, y, groups (5, array([10, 11]), 5, 3)
test ix, X, y, groups (8, array([16, 17]), 8, 6)
Solution 4:
Subclass and implement
a random_state dependent _iter_test_masks( ... random_state = None ) method,
as was self-documented in the sci-kit super(...)'s source. The random_state parameter, used in instantiation ( .__init__() is
"just" stored and left for user's creativity, if it will be or will not be used in any customised manner for a test_mask generation ( as literally expressed in sci-kit source comments ):
(cit.:)
# Since subclasses must implement either _iter_test_masks or# _iter_test_indices, neither can be abstract.def_iter_test_masks(self, X=None, y=None, groups=None):
"""Generates boolean masks corresponding to test sets.
By default, delegates to _iter_test_indices(X, y, groups)
"""for test_index in self._iter_test_indices(X, y, groups):
test_mask = np.zeros(_num_samples(X), dtype=np.bool)
test_mask[test_index] = Trueyield test_mask
Defining a process, that becomes dependent on externally provided random_state != None ought also perform a fair practice to protect - save / store the actual current state of the RNG ( RNG_stateTUPLE = numpy.random.get_state() ), set the one provided from .__init__() calling interface and after having been finished, restore the RNG state from the saved one ( numpy.random.set_state( RNG_stateTUPLE ) ).
This way such a custom-process gets both the required dependence on a random_state value, and reproducibility.
Q.E.D.
Solution 5:
I wanted to combine the code for groups k-fold and also wanted the same proportion of classes in the train and test set. So, I ran stratified k-fold over the groups such that same ratio of classes is maintained in the folds and then used the groups to place samples in the folds. I also included the random seed in the stratified to solve the different splits issue.
def Stratified_Group_KFold(Y, groups, n, seed=None):
unique= np.unique(groups)
group_Y = []
forgroupinunique:
y = Y[groups.index(subject)]
group_Y.append(y)
group_X = np.zeros_like(unique)
skf_group = StratifiedKFold(n_splits = n, random_state = seed, shuffle=True)
result= []
for train_index, test_index in skf_group.split(group_X, group_Y):
train_groups_in_fold =unique[train_index]
test_groups_in_fold =unique[test_index]
train = np.in1d(groups, train_groups_in_fold).nonzero()[0]
test = np.in1d(groups, test_groups_in_fold).nonzero()[0]
result.append((train, test))
returnresult
Post a Comment for "How To Obtain Reproducible But Distinct Instances Of Groupkfold"