In which situations should we decide to use regularization in our model building?

over_fitting
regularization
machine_learning

#1

Hi,

Regularization would help reduce the problem of overfitting the model on our data, but how do we know that our model would overfit OR in other words, how would we know when to apply the regularization methods to our data? What would be some indications that we can get from our data which would indicate the usage of regularizations methods?

Thanks.


#2

Hi Mukesh,

To know if a algorithm is over fitting a data, we split data in training and testing dataset. If the accuracy of model is varying more than 4%(not specific) in testing and training dataset, then it’s a problem of over-fitting. In this case you have to use regularization.

Regards,
Aayush


#3

Hi @mukesh,

Whenever you are facing one of these situations: large number of variables or low ratio of no. observations to no. variables (including the n≪p case), high collinearity, seeking for a sparse solution (i.e., embed feature selection when estimating model parameters), or accounting for variables grouping in high-dimensional data set.


#4

@aayushmnit - have a question here. When you say the testing data-set, are you mentioning about the new data or splitting the data-set into training and testing (say 70:30) by means of sampling?

I ask this question because - When i split the data-set into testing and training by sampling, there is no over-fitting and it is well within the 3% variation. Where as the new data-set completely behaves strangely. Lot of variation. Tried Cross-validation also, but no improvement.


#5

Hi @karthe1 ,

You are right in the first one I was talking about splitting the training data into 70:30.

For your Query -
In first observation - It seems like you have done a 70:30 split and trained on 70% dataset but deliberately tryingto fit the testing dataset, which led to indirect usage of 30% part also.

But in the second obervation - I think when you say cross validation you mean a 60-20-20 split of training dataset(Training - Testing - Cross validation) still if everything is in 3% range then probably your testing data set is actually having observation which are having large variations. I this case your best bet is to go for a CV model only.

Regards,
Aayush Agrawal


#6

@aayushmnit, thanks for the clarifications.

I was trying in both the observations that you mentioned.
ie. first - i split the dataset into training and testing (split by sampling method) which was producing quite good results for both test and training. Now i understand that i was indirectly using the same data in both places which resulted in good results.

Second - test data set is a new data. which is giving less accuracy comparing to the training data. Doing cross validation didnt reduce the variance much.

And when you said you split the training data into 70:30, how do you do? Just partition the data randomly? I beleive you dont split by the method of sampling.

Thanks again for the help


#7

Hi Karthe,

I use the caret package to do the splitting for me. I think it creates a stratified sample. Please find below the code to do so -

set.seed(25)
idx <- createDataPartition(y=train_raw_split_final_1$geniq, p =0.75,list = FALSE)
train_model <- train_raw_split_final_1[idx,]
test_model <- train_raw_split_final_1[-idx,]

p = 0.75 means that 75% - 25% split.

Hope this helps.

Regards,
Aayush Agrawal


#8

@aayushmnit, thanks again for the code. I was using some other simple sample taken from somewhere :wink:
I will try yours and also post the code i used as well soon.

Thanks again.