When should I shuffle in StratifiedKFold

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Track title: CC G Dvoks String Quartet No 12 Ame 2

--

Chapters
00:00 Question
03:43 Accepted answer (Score 0)
04:56 Answer 2 (Score 1)
05:47 Thank you

--

Full question
https://stackoverflow.com/questions/5961...

Accepted answer links:
[here]: https://stats.stackexchange.com
[difference between StratifiedKFold and StratifiedShuffleSplit in sklearn]: https://stackoverflow.com/questions/4596...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #machinelearning #scikitlearn #neuralnetwork #crossvalidation

#avk47

ANSWER 1

Score 2

When working with time series data you are correct that shuffling will inflate the accuracy. The reason is because shuffling the training set will cause it to contain samples that are very similar to samples found in the test set.

For example, if you trained a model from 2010-2019 and then predicted on 2020, all of the test set samples would be separate in time from the training period so there would be no leakage of information. Now lets say that there was an extreme event in 2020 and you shuffle the data. The training set will now contain samples of that extreme event from some sensors and then in the test set it will learn to predict a similar label for the other sensors during that period. This is a leakage of information between the training and test sets.

ACCEPTED ANSWER

Score 0

Your question is quite tricky and probably it is better placed here.

In my times series dataset of size 921 *10080 where each row is a time series of water temperature of a particular location in an area and the last column being the label with 2 groups

Aren't you using using a classification problem with timeseries futures? You are using dependeten variables (timeseries of the water temperature) to predict a label. For me this sounds risky, and I would assume that there is not a good chance to predict the label. Just one scenario to think about:

Location  Time1 Time2 Time3  Label
A         3       2    1      1
B         100     99   98     1
C         98      99   100    0

So in this example label 1 is a timeseries which goes down and label 0 is a timeseries that goes up, but I would bet every classifier has a problem to learn it without connecting the trending component of your columns.

To come back to your question, this can help you to understand shuffling: difference between StratifiedKFold and StratifiedShuffleSplit in sklearn