Is it wise to Split Training and Test Dataset based on time / year?

data_wrangling

#1

Hi,

I have a dataset for 6 years i.e. 2009-2015. Will it be wise to partition the dataset based on year. Year less than 2013 will be training dataset and rest will be test. My model accuracy on the test dataset is much lower when I partitioned using year.Please advice.

Regards
Balaji SR


#2

Hi @BALAJI_SR,

While creating train and test data the best option is to do a random sampling from the original data without any conditions.
Like if you are using R u can use set.seed(n) to do a random selection from the data.


#3

@BALAJI_SR, it depends on what you’re trying to do with the data. If you want to predict outcomes from 2013 and later based on “past” data (pre-2013), then that would be the way to go. (However, it is advisable to create a split the pre-2013 data into train and CV also.) If the time of the outcome is not important, then you could do a (pseudo-)random split on the response variable, to ensure a good mix in both the train and test sets.


#4

@BALAJI_SR

The first question, you need to answer is whether you expect any time based variations in what you are trying to predict?

Case 1: Time based output

In most of the cases, the answer would be yes for such a long duration. It is hard to think of any thing, which you might be interested in predicting, but would not have any time based changes. If the answer to the above question is yes, that it would always be a bad idea to split test and train using a time based (month / year) based variable. You will unnecessarily make your model blind to an important factor.

Let us understand this with an example:

  • You are trying to predict demand / supply of a product based on several variables - Almost all products go through a cycle of early adaption, gaining momentum to dying off. If you keep your model blind to any of this period, you are making a gross error.

So, it is pretty clear that if what you are predicting is time dependent in any form, this is a bad practice.

Case 2 - Seemingly time independent output

Now, let us take another example where you may not see significant time based changes in the outcome variable. In this case also, dividing test and train on time is a bad idea. Why? It is because, some of your inputs might still be dependent on time and your model will not be able to capture right relationships, if it is blind to a time period entirely. For example, let us say, you are trying to estimate unemployment in a country, which has remained stable over the years. But, it might be driven by rising employments in one sector and declining in other. So, if you keep your model blind to a time period, it would result in a bad model.

So, the only scenario, where you can take time based split is in case when all your inputs and outputs are not related to time - which is hardly the case.

Also, if you are forced to make a time based split, I would keep train as the recent data and test as the older data, so that you capture recent trends in a better manner.

Hope this helps.

Regards,
Kunal