Why do we want to remove the outliers of a feature?




I read multiple article and github tutorials regarding outlier treatment. In most of the cases, they are suggeting to delete the observations having outlier values. Could you suggest me the other way of treating outlier values? because it may be possible that these are natural outliers not due to data entry error or data collection error.


Handling Outliers


There are multiple methods to deal with outliers but before treating outlier values, you must know the reason of outliers. It could be due to data processing error, data capturing error, sampling error, measurement error or could be natural outlier.

If it is due to an error, we should delete or impute the outlier values with relevant values like average, mode, median or train a model to impute outlier values based on non-outlier observation.

If it is natural outlier, we can perform below operations:

  • Develop two different model for outlier or non-outlier observations
  • Use log, square root, square to deal with outliers.

For more detail on this, you can refer this [article][1].

[1]: http://www.analyticsvidhya.com/blog/2015/02/outliers-detection-treatment-dataset/