Why random forest could not handle large number of categorical predicators

r
random_forest

#1

I am currently doing a problem of classification using random forest and while solving I have created the model but my model is giving me a error

m1<-randomForest(as.factor(Survived)~.,data=new_train,ntree=2000)
Error in randomForest.default(m, y, …) :
Can not handle categorical predictors with more than 53 categories.

str(train)
‘data.frame’: 891 obs. of 12 variables:
PassengerId: int 1 2 3 4 5 6 7 8 9 10 ... Survived : int 0 1 1 1 0 0 0 0 1 1 …
Pclass : int 3 1 3 1 3 3 1 3 3 2 ... Name : Factor w/ 891 levels “Abbing, Mr. Anthony”,…: 109 191 358 277 16 559 520 629 417 581 …
Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ... Age : num 22 38 26 35 35 NA 54 2 27 14 …
SibSp : int 1 1 0 1 0 0 0 3 0 1 ... Parch : int 0 0 0 0 0 0 0 1 2 0 …
Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ... Fare : num 7.25 71.28 7.92 53.1 8.05 …
Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ... Embarked : Factor w/ 4 levels “”,“C”,“Q”,“S”: 4 2 4 4 4 3 4 4 4 2 …


#2

@hinduja1234

This is most probably due to the memory limitations of our systems in handling so many categories in a variable in a randomForest run.

Possible solutions are:

  1. convert the categorical variable into dummy binary variables.
  2. Use this package in R for handling randomForest in datasets that are too large to run in memory.

https://cran.r-project.org/web/packages/bigrf/bigrf.pdf

  1. use this approach which is an implementation of the first approach, although I’m not sure if it works. Can give it a try.

http://stackoverflow.com/questions/17027675/random-forest-does-not-seem-to-handle-more-than-32-categories-of-factors-what-d

Hope this helps!