When should I shuffle in StratifiedKFold

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Mysterious Puzzle

--

Chapters
00:00 When Should I Shuffle In Stratifiedkfold
02:43 Accepted Answer Score 0
03:45 Answer 2 Score 2
04:25 Thank you

--

Full question
https://stackoverflow.com/questions/5961...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #machinelearning #scikitlearn #neuralnetwork #crossvalidation

#avk47

ANSWER 1

Score 2

When working with time series data you are correct that shuffling will inflate the accuracy. The reason is because shuffling the training set will cause it to contain samples that are very similar to samples found in the test set.

For example, if you trained a model from 2010-2019 and then predicted on 2020, all of the test set samples would be separate in time from the training period so there would be no leakage of information. Now lets say that there was an extreme event in 2020 and you shuffle the data. The training set will now contain samples of that extreme event from some sensors and then in the test set it will learn to predict a similar label for the other sensors during that period. This is a leakage of information between the training and test sets.

ACCEPTED ANSWER

Score 0

Your question is quite tricky and probably it is better placed here.

In my times series dataset of size 921 *10080 where each row is a time series of water temperature of a particular location in an area and the last column being the label with 2 groups

Aren't you using using a classification problem with timeseries futures? You are using dependeten variables (timeseries of the water temperature) to predict a label. For me this sounds risky, and I would assume that there is not a good chance to predict the label. Just one scenario to think about:

Location  Time1 Time2 Time3  Label
A         3       2    1      1
B         100     99   98     1
C         98      99   100    0

So in this example label 1 is a timeseries which goes down and label 0 is a timeseries that goes up, but I would bet every classifier has a problem to learn it without connecting the trending component of your columns.

To come back to your question, this can help you to understand shuffling: difference between StratifiedKFold and StratifiedShuffleSplit in sklearn