How To Obtain Reproducible But Distinct Instances Of Groupkfold
Solution 1:
KFold
is only randomized ifshuffle=True
. Some datasets should not be shuffled.GroupKFold
is not randomized at all. Hence therandom_state=None
.GroupShuffleSplit
may be closer to what you're looking for.
A comparison of the group-based splitters:
- In
GroupKFold
, the test sets form a complete partition of all the data. LeavePGroupsOut
leaves all possible subsets of P groups out, combinatorially; test sets will overlap for P > 1. Since this meansP ** n_groups
splits altogether, often you want a small P, and most often wantLeaveOneGroupOut
which is basically the same asGroupKFold
withk=1
.GroupShuffleSplit
makes no statement about the relationship between successive test sets; each train/test split is performed independently.
As an aside,
Dmytro Lituiev has proposed an alternative GroupShuffleSplit
algorithm which is better at getting the right number of samples (not merely the right number of groups) in the test set for a specified test_size
.
Solution 2:
Inspired by user0's answer (can't comment) but faster:
defRandomGroupKFold_split(groups, n, seed=None): # noqa: N802"""
Random analogous of sklearn.model_selection.GroupKFold.split.
:return: list of (train, test) indices
"""
groups = pd.Series(groups)
ix = np.arange(len(groups))
unique = np.unique(groups)
np.random.RandomState(seed).shuffle(unique)
result = []
for split in np.array_split(unique, n):
mask = groups.isin(split)
train, test = ix[~mask], ix[mask]
result.append((train, test))
return result
Solution 3:
My solution so far has been to simply randomly split the groups. This could lead to very unbalanced groups (which I think GroupKFold
was designed to ward off), but the hope is that the number of observations per group is small.
from sklearn.utils import shuffle
from sklearn.model_selection import GroupKFold
from numpy.random import RandomState
import numpy as np
import sys
import pdb
random_state = int(sys.argv[1])
X = np.arange(20).reshape((10,2))
y = np.arange(10)
groups = np.array([0,0,0,1,2,3,4,5,6,7])
for el inzip(range(len(y)),X,y,groups):
print"ix, X, y, groups", el
defRandGroupKfold(groups, n_splits, random_state=None, shuffle_groups=False):
ix = np.array(range(len(groups)))
unique_groups = np.unique(groups)
if shuffle_groups:
prng = RandomState(random_state)
prng.shuffle(unique_groups)
splits = np.array_split(unique_groups, n_splits)
train_test_indices = []
for split in splits:
mask = [el in split for el in groups]
train = ix[np.invert(mask)]
test = ix[mask]
train_test_indices.append((train, test))
return train_test_indices
splits = RandGroupKfold(groups, n_splits=3, random_state=random_state, shuffle_groups=True)
for train, test in splits:
print"---"for el inzip(train, X[train], y[train], groups[train]):
print"train ix, X, y, groups", el
for el inzip(test, X[test], y[test], groups[test]):
print"test ix, X, y, groups", el
Data:
ix, X, y, groups (0, array([0, 1]), 0, 0)
ix, X, y, groups (1, array([2, 3]), 1, 0)
ix, X, y, groups (2, array([4, 5]), 2, 0)
ix, X, y, groups (3, array([6, 7]), 3, 1)
ix, X, y, groups (4, array([8, 9]), 4, 2)
ix, X, y, groups (5, array([10, 11]), 5, 3)
ix, X, y, groups (6, array([12, 13]), 6, 4)
ix, X, y, groups (7, array([14, 15]), 7, 5)
ix, X, y, groups (8, array([16, 17]), 8, 6)
ix, X, y, groups (9, array([18, 19]), 9, 7)
Random state as 4
---
train ix, X, y, groups (0, array([0, 1]), 0, 0)
train ix, X, y, groups (1, array([2, 3]), 1, 0)
train ix, X, y, groups (2, array([4, 5]), 2, 0)
train ix, X, y, groups (3, array([6, 7]), 3, 1)
train ix, X, y, groups (4, array([8, 9]), 4, 2)
train ix, X, y, groups (7, array([14, 15]), 7, 5)
train ix, X, y, groups (8, array([16, 17]), 8, 6)
test ix, X, y, groups (5, array([10, 11]), 5, 3)
test ix, X, y, groups (6, array([12, 13]), 6, 4)
test ix, X, y, groups (9, array([18, 19]), 9, 7)
---
train ix, X, y, groups (4, array([8, 9]), 4, 2)
train ix, X, y, groups (5, array([10, 11]), 5, 3)
train ix, X, y, groups (6, array([12, 13]), 6, 4)
train ix, X, y, groups (8, array([16, 17]), 8, 6)
train ix, X, y, groups (9, array([18, 19]), 9, 7)
test ix, X, y, groups (0, array([0, 1]), 0, 0)
test ix, X, y, groups (1, array([2, 3]), 1, 0)
test ix, X, y, groups (2, array([4, 5]), 2, 0)
test ix, X, y, groups (3, array([6, 7]), 3, 1)
test ix, X, y, groups (7, array([14, 15]), 7, 5)
---
train ix, X, y, groups (0, array([0, 1]), 0, 0)
train ix, X, y, groups (1, array([2, 3]), 1, 0)
train ix, X, y, groups (2, array([4, 5]), 2, 0)
train ix, X, y, groups (3, array([6, 7]), 3, 1)
train ix, X, y, groups (5, array([10, 11]), 5, 3)
train ix, X, y, groups (6, array([12, 13]), 6, 4)
train ix, X, y, groups (7, array([14, 15]), 7, 5)
train ix, X, y, groups (9, array([18, 19]), 9, 7)
test ix, X, y, groups (4, array([8, 9]), 4, 2)
test ix, X, y, groups (8, array([16, 17]), 8, 6)
Random state as 5
---
train ix, X, y, groups (0, array([0, 1]), 0, 0)
train ix, X, y, groups (1, array([2, 3]), 1, 0)
train ix, X, y, groups (2, array([4, 5]), 2, 0)
train ix, X, y, groups (3, array([6, 7]), 3, 1)
train ix, X, y, groups (5, array([10, 11]), 5, 3)
train ix, X, y, groups (7, array([14, 15]), 7, 5)
train ix, X, y, groups (8, array([16, 17]), 8, 6)
test ix, X, y, groups (4, array([8, 9]), 4, 2)
test ix, X, y, groups (6, array([12, 13]), 6, 4)
test ix, X, y, groups (9, array([18, 19]), 9, 7)
---
train ix, X, y, groups (4, array([8, 9]), 4, 2)
train ix, X, y, groups (5, array([10, 11]), 5, 3)
train ix, X, y, groups (6, array([12, 13]), 6, 4)
train ix, X, y, groups (8, array([16, 17]), 8, 6)
train ix, X, y, groups (9, array([18, 19]), 9, 7)
test ix, X, y, groups (0, array([0, 1]), 0, 0)
test ix, X, y, groups (1, array([2, 3]), 1, 0)
test ix, X, y, groups (2, array([4, 5]), 2, 0)
test ix, X, y, groups (3, array([6, 7]), 3, 1)
test ix, X, y, groups (7, array([14, 15]), 7, 5)
---
train ix, X, y, groups (0, array([0, 1]), 0, 0)
train ix, X, y, groups (1, array([2, 3]), 1, 0)
train ix, X, y, groups (2, array([4, 5]), 2, 0)
train ix, X, y, groups (3, array([6, 7]), 3, 1)
train ix, X, y, groups (4, array([8, 9]), 4, 2)
train ix, X, y, groups (6, array([12, 13]), 6, 4)
train ix, X, y, groups (7, array([14, 15]), 7, 5)
train ix, X, y, groups (9, array([18, 19]), 9, 7)
test ix, X, y, groups (5, array([10, 11]), 5, 3)
test ix, X, y, groups (8, array([16, 17]), 8, 6)
Solution 4:
Subclass and implement
a random_state
dependent _iter_test_masks( ... random_state = None )
method,
as was self-documented in the sci-kit super(...)
's source. The random_state
parameter, used in instantiation ( .__init__()
is
"just" stored and left for user's creativity, if it will be or will not be used in any customised manner for a test_mask
generation ( as literally expressed in sci-kit source comments ):
(cit.:)
# Since subclasses must implement either _iter_test_masks or# _iter_test_indices, neither can be abstract.def_iter_test_masks(self, X=None, y=None, groups=None):
"""Generates boolean masks corresponding to test sets.
By default, delegates to _iter_test_indices(X, y, groups)
"""for test_index in self._iter_test_indices(X, y, groups):
test_mask = np.zeros(_num_samples(X), dtype=np.bool)
test_mask[test_index] = Trueyield test_mask
Defining a process, that becomes dependent on externally provided random_state != None
ought also perform a fair practice to protect - save / store the actual current state of the RNG ( RNG_stateTUPLE = numpy.random.get_state()
), set the one provided from .__init__()
calling interface and after having been finished, restore the RNG state from the saved one ( numpy.random.set_state( RNG_stateTUPLE )
).
This way such a custom-process gets both the required dependence on a random_state
value, and reproducibility.
Q.E.D.
Solution 5:
I wanted to combine the code for groups k-fold and also wanted the same proportion of classes in the train and test set. So, I ran stratified k-fold over the groups such that same ratio of classes is maintained in the folds and then used the groups to place samples in the folds. I also included the random seed in the stratified to solve the different splits issue.
def Stratified_Group_KFold(Y, groups, n, seed=None):
unique= np.unique(groups)
group_Y = []
forgroupinunique:
y = Y[groups.index(subject)]
group_Y.append(y)
group_X = np.zeros_like(unique)
skf_group = StratifiedKFold(n_splits = n, random_state = seed, shuffle=True)
result= []
for train_index, test_index in skf_group.split(group_X, group_Y):
train_groups_in_fold =unique[train_index]
test_groups_in_fold =unique[test_index]
train = np.in1d(groups, train_groups_in_fold).nonzero()[0]
test = np.in1d(groups, test_groups_in_fold).nonzero()[0]
result.append((train, test))
returnresult
Post a Comment for "How To Obtain Reproducible But Distinct Instances Of Groupkfold"