Running randomForest on subset data


Hi All,

I am trying to subset B from a bigger data A on some condition and trying to predict the bucket(Y) in which the sales of B lie using randomForest technique.

Following is the code I have used:-
sample.ind<-sample(2,nrow(mydata), replace = T,
prob = c(0.70,0.30))
crosssell_dev<- mydata[sample.ind==1,]
crosssell_val<- mydata[sample.ind==2,]
df<- subset(crosssell_dev, crosssell_dev$Account==“10185”)
mydata.rf<- randomForest(formula=df$Bucket~.,data = df,ntree=100,mtry=5,importance=T)

but this is giving an error which is:“Error in randomForest.default(m, y, …) : Can’t have empty classes in y”.

Can Any please suggest what is wrong here.


Probably your Y variable is no longer a factor variable. Try resetting levels for this variable.


VIvek, How will I achieve it?
Also when I am running the complete data it is running and showing the output.
Also I checked the structure of new data where my y is still of Factor Type


You are probably missing values of ‘Y’ when you subset the data. You need to take into account that factor level is a quality of the variable, and it remains the same regardless what subset the data you take. Random forests complain because probably in your new subset there are some values of Y that are not present in this new subset.

To avoid this, you can use droplevels function:

df <- droplevels(crosssell_dev[crosssell_dev$Account=='10185',])

Hope this helps!