Why we need to classify validation dataset along with train and test?


#1

Hello All,

I found this confusing when I use the datasets into partitions .Like in some competitions

Some places i saw

training set
validation set
test set

And some places only Train and test data.(Competitions)

Where we use validation datset ??Any specific techniques only we have to use this??

I am aware of how to divide the sets but i have a confusion like why we have to do .Can any one please clarifies for me .

Regards,
Raghavendra


#2

Hi Raghavendra
When you train and test, you are essentially iterating between those two data sets. Build the model using train, see how it performs on test, go back tweak, see again the test result and so on. What is happening here is that inherently you are also ‘using’ up the test data set indirectly into your model. Now if you submit this model, it still remains untested on completely fresh, unseen data. We don’t know how it would perform.

To get over this, at times when the size of the data is large, OR if there are multiple competing models all appearing to be similar in their predictive performance OR if the client wants to check your model on his own undisclosed dataset, it is advised to create three partitions instead of the two. The three partitions are called Train, Validate and Test. You iterate between Train & Validate while keep Test aside locked up somewhere else. Once you are satisfied that this is the best model, then go to the test sample and check its performance. It is also usually advised to test only once as that gives the best indicator on how your model will perform on unseen data.

Hope this helps
Regards


#3

@raghava_r4u,

I some competitions, you will find only the test and the training datasets. These competitions would require you to submit your prediction for the test dataset and will then calculate your accuracy and give you the result. There is no way for you to check your accuracy on the test dataset before submission of your result.

In other competitions where you may be provided the validation dataset or are required to create one from the training datset is for the requirement of checking the accuracy of your model before predicting for the test dataset as you are already having the output (what you want to predict) for the cross validation set and you just need to compare that output with the output you get after predicting for the validation dataset.

So you use validation data in order to estimate how good your model has been trained (that is dependent upon the size of your data, the value you would like to predict, input etc) and to estimate model properties (mean error for numeric predictors, classification errors for classifiers, recall and precision for IR-models etc.)
This intermediate step on the validation set helps in avoiding problems such as overfitting on the train data.

After checking your accuracy on the validation dataset, you can be sure that you would get almost the same accuracy on your test dataset(unseen data) as well!

Hope this helps!


#4

Thanks alot for a elobrated replies.I got the point.Nice interacting with you people . Thank once again to @manizoya_1 & @Aditya_Sharma


#5

Thanks for sharing information. I have one doubt here ,As Its essential to divide the data into test , train and validate , kindly let us know , what percentage of data is to be given for test , train and validate respectively ?.

Thanks
Shouib M


#6

@ShouibM I depends on person to person and task at hand. But many practitioner prefer a 70 -30 or 80-20 split of training data into new training set and validation set.

For the test data, in real world problems. test data is only anticipated. You don;t get it in advance. whereas for competitions you are generally provided with a test set separately.