The Python Oracle

Why does calling the KFold generator with shuffle give the same indices?

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Light Drops

--

Chapters
00:00 Why Does Calling The Kfold Generator With Shuffle Give The Same Indices?
00:51 Accepted Answer Score 6
01:43 Answer 2 Score 0
02:25 Thank you

--

Full question
https://stackoverflow.com/questions/3494...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #scikitlearn #crossvalidation

#avk47



ACCEPTED ANSWER

Score 6


A new iteration with the same KFold object will not reshuffle the indices, that only happens during instantiation of the object. KFold() never sees the data but knows number of samples so it uses that to shuffle the indices. From the code during instantiation of KFold:

if shuffle:
    rng = check_random_state(self.random_state)
    rng.shuffle(self.idxs)

Each time a generator is called to iterate through the indices of each fold, it will use same shuffled indices and divide them the same way.

Take a look at the code for the base class of KFold _PartitionIterator(with_metaclass(ABCMeta)) where __iter__ is defined. The __iter__ method in the base class calls _iter_test_indices in KFold to divide and yield the train and test indices for each fold.




ANSWER 2

Score 0


With new version of sklearn by calling from sklearn.model_selection import KFold, KFold generator with shuffle give the different indices:

import numpy as np
from sklearn.model_selection import KFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=3, shuffle=True)

print('---first round----')
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    
print('---second round----')
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)

Out:

---first round----
TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1 3] TEST: [2]
TRAIN: [0 1 2] TEST: [3]
---second round----
TRAIN: [0 1] TEST: [2 3]
TRAIN: [1 2 3] TEST: [0]
TRAIN: [0 2 3] TEST: [1]

Alternatively, the code below iteratively generates same result:

from sklearn.model_selection import KFold
np.random.seed(42)
data = np.random.choice([0, 1], 10, p=[0.5, 0.5])
kf = KFold(2, shuffle=True, random_state=2022)
list(kf.split(data))

Out:

[(array([0, 1, 6, 8, 9]), array([2, 3, 4, 5, 7])),
 (array([2, 3, 4, 5, 7]), array([0, 1, 6, 8, 9]))]