Treatment of Outliers



What is the approach for outliers (in case it is natural)?

In my case, i have a variable with 70K observations.
If the outliers are 3400, and the magnitude of that is relatively high.For example, mean of the variable is 500 and the outliers are in the range of 2k - 12k.
Here data is related to telecom, so there are more chances for few customers can account for more usage comparatively.
If it is natural, how to deal with this???


One way to handle it is to use a logarithmic transformation. This will gather the points together and then you can proceed as usual. Another possibility is to apply clustering to segment the data by the given feature(s) and then do the analysis/modeling separately for each cluster.

Regarding what is the better approach, it depends on the data. If there is no hidden pattern in the outliers, the log transform works fine. Otherwise, clustering will do a better job to capture the information.

Also note that if you use tree based methods there is no need to use log transformation, because the way the space is split is invariant to this kind of transform.