Overfitting Problem in Random Forest

r
random_forest

#1

Hi Guys,

I am trying to solve a regression problem using R caret package. I have tried multiple algorithms and identified that most of my algorithms are over fitting the data. Can somebody suggest me some ways apart from k fold validation and tuning mtry parameter in Random forest to overcome this problem?

I have seen there is something called Regularized Random forest but couldn’t find any example related to it. Also how to do pruning and significance test using Random forest using the Caret package so the algorithm can become more robust.

Any help with a sample code will be appreciated!

Regards,
Aayush


#2

Have you tried RRF package from CRAN?

Looks exactly some thing like what you may need. Here is the link to the manual:


#3

Hi @kunal,

Thanks for replying. The Regularized Random forest in R caret package use the same “RRF” package given by you. The only questions is that there are three parameters given for tuning are -

  1. Mtry - how many predictors to select randomly in one tree
  2. coefReg - Regularization value
  3. coefImp - Importance coefficient

For the last 2 tuning parameters i.e coefReg and coefImp, I have no clue what this means and what are the ideal values to test in these parameters to get to an optimized solution. Example : coefReg is always in between 0 to 1 but what about Coefimp what does it do in actuals, what are it’s ideal value?

Please help me in understanding these parameters.

Regards,
Aayush Agrawal


#4

@aayushmnit.

Did you manage to solve this problem?
I also face overfitting problem in random forest. Tried with k-fold cv, classwt(increasing the cost of misclassification) , mtry etc. Seems nothing improving and not able to overcome this problem.


#5

@karthe1 Actually, i found that Regularized Random forest is computationally very expensive as I was not able to run it, so I went ahead and reduced the granularity of the algorithm by increasing the nodesize to 60.


#6

@aayushmnit,

Thanks for the reply. I also wanted to ask you how do you choose the node size, maxnodes, mtry etc?
I learned from one of your other post on how to choose the number of trees to be used using ntree parameter(using plot command). Is there anything similar to that available for the other three parameters as well?

Thank you.


#7

@aayushmnit - I used tuneRF to find the optimum mtry. But that alone not improving the model :frowning:


#8

Hi @karthe1,

If it’s an overfitting problem, tuning Mtry is not the best method to opt. How I do the nodesize optimization is by first taking 5 models with nodesize - 0, 25, 50, 75, 100 and measure accuracy for training and testing dataset and see where accuracy is pretty much similar in both the dataset. Say you got a solution in which 50 and 75 are the cases where you are getting such solution then we again make models with nodesize 55 , 60 ,65, 70 to reach to the optimum solution. One more thing - in most of the cases you will find solution in 0 - 100 range only , but then also sometimes if dataset is large then we can have solutions with nodesize >100 also.

Hope this helps.

Regards,
Aayush


#9

@aayushmnit, thank you very much for this suggestion. I will try the same and let you know the outcome.

I am interested in knowing where you use mtry. What can be achieved using this mtry?

I was also trying the “classwt” parameter. Misclassifciation cost of random forest. It did worked, but was just moving the numbers between the class 0 / class 1 of the classification problem.

Thanks for the suggestion again.


#10

Hi @karthe1,

We can use mtry and try to optimize it in case where over fitting is not the problem.

And please let me know if changing nodesizes and classwt parameters is helping you solve your problem.

Good Luck!

Regards,
Aayush


#11

@aayushmnit, Thanks for the suggestions.

Yes, both nodesize and classwt parameters are working, but still the problem is not solved yet. Still working on to improve. I will keep you updated how things progresses.Thank you again especially for your awesome explanation on nodesize.