Q4 Why to remove outliers value from the variable?

r

#1

I am currently trying to build classification model for building it I am first trying data exploration .I have studied that we should remove the outlier from each variable. I want to know why we should remove outlier and what are the methods to remove it.
For example

quantile(cnt_df$Team.size.all.employees, probs = seq(0, 1, by= 0.05),na.rm=T)
output: 
0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 
1.0 3.0 4.0 6.0 8.0 10.0 10.0 10.0 11.0 14.0 16.5 23.0 
60% 65% 70% 75% 80% 85% 90% 95% 100% 
30.0 40.0 50.0 50.0 50.0 50.0 80.0 200.0 5000.0

#2

@harry-
We should remove outliers from the variable because it will not give correct result when we use this variable for model building for example If we replace the NA value of this variable by mean of this variable and it contains the outliers, the mean will not correctly represent the variable and it will not provide correct result of model.

There is numerous way to deal with outliers.
1- you can take mode of the variable for filling missing value
2- you can set the threshold value, for example, like in the given set there is sudden jump after 90% you can remove this values.

Hope this helps!

Regards,
Hinduja


#3

@harry…thanks for explanation…again if there is large no of data in that 80-200 category…so should I still consider capping value to 50 or should i limit it to 200?


#4

@Azim-
It completely depends upon the importance of the variable if the variable which you are removing does not affect the modelling and it is outliers we should remove this.

Hope this helps!

Regards,
Hinduja