How to improve Accuracy of Decision Tree / Random Forest



Hi Team,

I am trying to resolve a classification problem using decision tree or random forest.

Amount of data is good and the attributes are majorly categorical in nature and one is continuous. There are too many categories for most of the attributes. And I am unable to get more than 60% accuracy, which seems to be less. I am stuck with the problem on how to increase the accuracy so much so that it doesn’t overfit.

If anybody has any suggestions, kindly help.

Also, would like to know how creation of dummy variables for each category of an attribute for all attributes increase the accuracy. ( Read it on internet )


hello @k_saurabh86,

If there are too many categories in a variable you can combine the categories having less than 5% frequency.

train$Style[train$Style == 'vintage'] <- 'work'
train$Style[train$Style %in% c('Sexy','bohemian')] <- 'sexy'
train$Style[train$Style %in% c('Flare','Novelty','OL','work')] <- 'Brief' 

Now you can create dummy variables for the new categories.Creating dummies breaks the categorical data into numeric which helps in quantifying the relationships better and hence increases the accuracy.
you can go through;
for a detailed understanding of how to deal with continuous variables.