How to avoid Overfitting with classification techniques?



Hi friends,

I use classification algorithm a lot, mostly decision tree as it is easy to deploy and my customers can understand it quickly. Other reason of using decision tree and other classification techniques is because I usually have multiple categorical variables and continuous variables.

One of the major concern I always read and face is overfitting with classification algorithms. Can you suggest the ways those will help me to reduce the changes of overfitting and improve the performance of model.




Which technique do you use under decision tree?
Have you tried pruning the tree?


Hi Mark,
Many techniques are prone to over-fitting for instance gradient boosting or sometimes even SVM. Random Forest generally is expected to be most stable given its an ensemble of many decision trees and captures variance from a large number of variables. But, its not the technique you should be concerned about but is the cross validation technique when you talk about over fitting. The biggest concern with cross validation is to manage the trade off between minimize over-fit and minimize selection bias. The solution is to do a k-fold validation. You can read a number of resources on k-fold cross validation. Generally a value of k = 10 is what is suggested to minimize the over-fit and also minimize the bias in train population selection.