Outlier treatment in R

outliers

#1

Hello,

I am not able to understand the basic methodology of outlier treatment. I know that we should treat the outliers separately in the statistical model if their number is significant.
Do we need to remove all outliers from a sample or just reducing their number significantly would work?
What is the basic idea?

Thank you.


#2

Hello @Aditya_Sharma,

Outlier treatment is a very important part of an Analytics Process and proper treatment is absolutely necessary so that the model is appropriate.
There are lots of statistical methods like Capping/Flooring, Sigma approach, Exponential
Smoothing, Mahalanobis distance and the Robust Regression approach.
But you really need to understand the data and the ‘story’ behind the outliers. For example in a clustering problem where you get clusters of stores of a retailer spread across india(according to some factor/s-sales/sq.ft) there might be 4-5 stores which have very high values compared to the rest of the population.But on investigation you find that these stores just have smaller area than the others.Thus in this case though you can term them as outliers,it may not be wise to exclude them.
Consider another case where you are finding the relationship between the GDP of a country and internet penetration per 100 users.The plot looks:

You might be tempted to think that Norway may be an outlier if you just consider the GDP or penetration variables,but together they really are significant for the hypothesis in this case:more GDP,more Internet Penetration.
But in some cases,it will be just right to drop the outliers.
So as you can see,the treatment really matters on the business case,the story of the data etc.
It is a vast topic and even I am not very conversant with all the methods to treat outliers,but one general thumb rule is if a data point is more than 3 s.d from the it is considered to be an outlier.
Hope this helps!!