Skip to content Skip to sidebar Skip to footer

How To Obtain Reproducible But Distinct Instances Of Groupkfold

In the GroupKFold source, the random_state is set to None def __init__(self, n_splits=3): super(GroupKFold, self).__init__(n_splits, shuffle=False,

Solution 1:

  • KFold is only randomized if shuffle=True. Some datasets should not be shuffled.
  • GroupKFold is not randomized at all. Hence the random_state=None.
  • GroupShuffleSplit may be closer to what you're looking for.

A comparison of the group-based splitters:

  • In GroupKFold, the test sets form a complete partition of all the data.
  • LeavePGroupsOut leaves all possible subsets of P groups out, combinatorially; test sets will overlap for P > 1. Since this means P ** n_groups splits altogether, often you want a small P, and most often want LeaveOneGroupOut which is basically the same as GroupKFold with k=1.
  • GroupShuffleSplit makes no statement about the relationship between successive test sets; each train/test split is performed independently.

As an aside, Dmytro Lituiev has proposed an alternative GroupShuffleSplit algorithm which is better at getting the right number of samples (not merely the right number of groups) in the test set for a specified test_size.

Solution 2:

Inspired by user0's answer (can't comment) but faster:

defRandomGroupKFold_split(groups, n, seed=None):  # noqa: N802"""
    Random analogous of sklearn.model_selection.GroupKFold.split.

    :return: list of (train, test) indices
    """
    groups = pd.Series(groups)
    ix = np.arange(len(groups))
    unique = np.unique(groups)
    np.random.RandomState(seed).shuffle(unique)
    result = []
    for split in np.array_split(unique, n):
        mask = groups.isin(split)
        train, test = ix[~mask], ix[mask]
        result.append((train, test))

    return result

Solution 3:

My solution so far has been to simply randomly split the groups. This could lead to very unbalanced groups (which I think GroupKFold was designed to ward off), but the hope is that the number of observations per group is small.

from sklearn.utils import shuffle
from sklearn.model_selection import GroupKFold
from numpy.random import RandomState
import numpy as np
import sys
import pdb

random_state = int(sys.argv[1])


X = np.arange(20).reshape((10,2))


y = np.arange(10)
groups = np.array([0,0,0,1,2,3,4,5,6,7])
for el inzip(range(len(y)),X,y,groups):
    print"ix, X, y, groups", el

defRandGroupKfold(groups, n_splits, random_state=None, shuffle_groups=False):

    ix = np.array(range(len(groups)))
    unique_groups = np.unique(groups)
    if shuffle_groups:
        prng = RandomState(random_state)
        prng.shuffle(unique_groups)
    splits = np.array_split(unique_groups, n_splits)
    train_test_indices = []

    for split in splits:
        mask = [el in split for el in groups]
        train = ix[np.invert(mask)]
        test = ix[mask]
        train_test_indices.append((train, test))
    return train_test_indices

splits = RandGroupKfold(groups, n_splits=3, random_state=random_state, shuffle_groups=True)

for train, test in splits:
    print"---"for el inzip(train, X[train], y[train], groups[train]):
        print"train ix, X, y, groups", el
    for el inzip(test, X[test], y[test], groups[test]):
        print"test ix, X, y, groups", el

Data:

ix, X, y, groups (0, array([0, 1]), 0, 0)
ix, X, y, groups (1, array([2, 3]), 1, 0)
ix, X, y, groups (2, array([4, 5]), 2, 0)
ix, X, y, groups (3, array([6, 7]), 3, 1)
ix, X, y, groups (4, array([8, 9]), 4, 2)
ix, X, y, groups (5, array([10, 11]), 5, 3)
ix, X, y, groups (6, array([12, 13]), 6, 4)
ix, X, y, groups (7, array([14, 15]), 7, 5)
ix, X, y, groups (8, array([16, 17]), 8, 6)
ix, X, y, groups (9, array([18, 19]), 9, 7)

Random state as 4

---
train ix, X, y, groups (0, array([0, 1]), 0, 0)
train ix, X, y, groups (1, array([2, 3]), 1, 0)
train ix, X, y, groups (2, array([4, 5]), 2, 0)
train ix, X, y, groups (3, array([6, 7]), 3, 1)
train ix, X, y, groups (4, array([8, 9]), 4, 2)
train ix, X, y, groups (7, array([14, 15]), 7, 5)
train ix, X, y, groups (8, array([16, 17]), 8, 6)
test ix, X, y, groups (5, array([10, 11]), 5, 3)
test ix, X, y, groups (6, array([12, 13]), 6, 4)
test ix, X, y, groups (9, array([18, 19]), 9, 7)
---
train ix, X, y, groups (4, array([8, 9]), 4, 2)
train ix, X, y, groups (5, array([10, 11]), 5, 3)
train ix, X, y, groups (6, array([12, 13]), 6, 4)
train ix, X, y, groups (8, array([16, 17]), 8, 6)
train ix, X, y, groups (9, array([18, 19]), 9, 7)
test ix, X, y, groups (0, array([0, 1]), 0, 0)
test ix, X, y, groups (1, array([2, 3]), 1, 0)
test ix, X, y, groups (2, array([4, 5]), 2, 0)
test ix, X, y, groups (3, array([6, 7]), 3, 1)
test ix, X, y, groups (7, array([14, 15]), 7, 5)
---
train ix, X, y, groups (0, array([0, 1]), 0, 0)
train ix, X, y, groups (1, array([2, 3]), 1, 0)
train ix, X, y, groups (2, array([4, 5]), 2, 0)
train ix, X, y, groups (3, array([6, 7]), 3, 1)
train ix, X, y, groups (5, array([10, 11]), 5, 3)
train ix, X, y, groups (6, array([12, 13]), 6, 4)
train ix, X, y, groups (7, array([14, 15]), 7, 5)
train ix, X, y, groups (9, array([18, 19]), 9, 7)
test ix, X, y, groups (4, array([8, 9]), 4, 2)
test ix, X, y, groups (8, array([16, 17]), 8, 6)

Random state as 5

---
train ix, X, y, groups (0, array([0, 1]), 0, 0)
train ix, X, y, groups (1, array([2, 3]), 1, 0)
train ix, X, y, groups (2, array([4, 5]), 2, 0)
train ix, X, y, groups (3, array([6, 7]), 3, 1)
train ix, X, y, groups (5, array([10, 11]), 5, 3)
train ix, X, y, groups (7, array([14, 15]), 7, 5)
train ix, X, y, groups (8, array([16, 17]), 8, 6)
test ix, X, y, groups (4, array([8, 9]), 4, 2)
test ix, X, y, groups (6, array([12, 13]), 6, 4)
test ix, X, y, groups (9, array([18, 19]), 9, 7)
---
train ix, X, y, groups (4, array([8, 9]), 4, 2)
train ix, X, y, groups (5, array([10, 11]), 5, 3)
train ix, X, y, groups (6, array([12, 13]), 6, 4)
train ix, X, y, groups (8, array([16, 17]), 8, 6)
train ix, X, y, groups (9, array([18, 19]), 9, 7)
test ix, X, y, groups (0, array([0, 1]), 0, 0)
test ix, X, y, groups (1, array([2, 3]), 1, 0)
test ix, X, y, groups (2, array([4, 5]), 2, 0)
test ix, X, y, groups (3, array([6, 7]), 3, 1)
test ix, X, y, groups (7, array([14, 15]), 7, 5)
---
train ix, X, y, groups (0, array([0, 1]), 0, 0)
train ix, X, y, groups (1, array([2, 3]), 1, 0)
train ix, X, y, groups (2, array([4, 5]), 2, 0)
train ix, X, y, groups (3, array([6, 7]), 3, 1)
train ix, X, y, groups (4, array([8, 9]), 4, 2)
train ix, X, y, groups (6, array([12, 13]), 6, 4)
train ix, X, y, groups (7, array([14, 15]), 7, 5)
train ix, X, y, groups (9, array([18, 19]), 9, 7)
test ix, X, y, groups (5, array([10, 11]), 5, 3)
test ix, X, y, groups (8, array([16, 17]), 8, 6)

Solution 4:

Subclass and implement

a random_state dependent _iter_test_masks( ... random_state = None ) method, as was self-documented in the sci-kit super(...)'s source. The random_state parameter, used in instantiation ( .__init__() is "just" stored and left for user's creativity, if it will be or will not be used in any customised manner for a test_mask generation ( as literally expressed in sci-kit source comments ):

(cit.:)

# Since subclasses must implement either _iter_test_masks or# _iter_test_indices, neither can be abstract.def_iter_test_masks(self, X=None, y=None, groups=None):
    """Generates boolean masks corresponding to test sets.

    By default, delegates to _iter_test_indices(X, y, groups)
    """for test_index in self._iter_test_indices(X, y, groups):
        test_mask = np.zeros(_num_samples(X), dtype=np.bool)
        test_mask[test_index] = Trueyield test_mask

Defining a process, that becomes dependent on externally provided random_state != None ought also perform a fair practice to protect - save / store the actual current state of the RNG ( RNG_stateTUPLE = numpy.random.get_state() ), set the one provided from .__init__() calling interface and after having been finished, restore the RNG state from the saved one ( numpy.random.set_state( RNG_stateTUPLE ) ).

This way such a custom-process gets both the required dependence on a random_state value, and reproducibility. Q.E.D.

Solution 5:

I wanted to combine the code for groups k-fold and also wanted the same proportion of classes in the train and test set. So, I ran stratified k-fold over the groups such that same ratio of classes is maintained in the folds and then used the groups to place samples in the folds. I also included the random seed in the stratified to solve the different splits issue.

def Stratified_Group_KFold(Y, groups, n, seed=None):
    unique= np.unique(groups)
    group_Y = []
    forgroupinunique:
        y = Y[groups.index(subject)]
        group_Y.append(y)

    group_X = np.zeros_like(unique)
    skf_group = StratifiedKFold(n_splits = n, random_state = seed, shuffle=True)

    result= []
    for train_index, test_index in skf_group.split(group_X, group_Y):
        train_groups_in_fold =unique[train_index]
        test_groups_in_fold =unique[test_index]

        train = np.in1d(groups, train_groups_in_fold).nonzero()[0]
        test = np.in1d(groups, test_groups_in_fold).nonzero()[0]

        result.append((train, test))


    returnresult

Post a Comment for "How To Obtain Reproducible But Distinct Instances Of Groupkfold"