Should the train and test data be combined before running an algorithm

data_wrangling

#1

Hello,

While doing a classification problem,should we append the test data to the train data to get more records on which to train the model.?
The output of the model on the combination can then be used for prediction.
I am not sure if this is a good idea or not so if somebody can kindly help me on this!!


#2

@shuvayan Wouldn’t that defeat the purpose? You would be using the test set to calibrate the model, which will then be used to predict test set values again. I think it would be better to split the training set into a smaller train and test/CV set. To get a variety of data, one way is to split the train set into smaller bits, keeping one of them as a test set, and merge the rest as the new training set. Repeat this with keeping another of smaller sets as test/CV, and so on. The exact proportions of the smaller sets will, of course, depend on the original size of the training set. If you can measure the AUC of these different models, you would get an idea of its efficacy by taking the average AUC, for example.

This something I learnt while participating in a EdX/Kaggle contest. Please let me know if I’m mistaken about this process.


#3

Hi @anon,

Actually I realised the futility of the question 10 mins after posting it,but could not remove the ques. :stuck_out_tongue:
You are right about splitting the train data itself for better model performance.Am on it now.Lets see what comes out.
Thanks. :smile:


#4

For building a model you should use training data to train and testdata to test.
But after you have tested and chosen the best performing model, you should usually combine training + testdata and train your model again with all the data.

You have already tested and come to conclusion that your chosen way to build model is correct. Now you just train it similar way but with more data, which should lead to slightly better model.


#5

Hi Shuvayan
I will say its a smart move to do so. I generally create a flag for train and test and then append the two before making transformations. This makes the coding easier and generally faster. Once everything is done, I separate the two samples and work individually.
Regards,
Tavish