Data Pre-processing for Decision Trees



I was reading somewhere that decision trees doesnt require data cleaning / outlier removal / normality check etc. Is that true? Is it applicable for random forest as well?



Decision trees are definitely more robust to Outliers and missing values than regression techniques. This is because they work on segmentation of population and treat all missing values as a different class itself.

Since random forests are in turn based on decision trees, this holds true for them as well.

Having said that, this does not mean you do not need data pre-processing while using these algorithms. If you can still combine classes of variables (e.g. Spelling variations of Gender), you should do that.



Hi Kunal sir, thanks for the clarification.