Outlier Analysis

data_science
python

#1

(a)Box Plot method is used to spot & remove the outlier for a continous variable. Is it possible to have outlier for a categorical variable if yes what is the correct technique to remove outlier?
(b) In between Box Plot Analysis & Replace with NA which is more effective & why?


#2

Hi @swarup17,

You can use the value_counts() function, to identify the outliers in categorical variables.

These are two completely different things. You would use the box plot analysis to analyze the data and see the distribution. Replace NA is to fill the missing values in the data. So can you explain why are you trying to compare these two techniques?


#3

Hi Aishwarya,

Thanks for the reply.
As per my knowledge
(b) Incase of Replace NA:-
After identifying the outlier via box plot analysis if certain outliers are important for running our model the outliers needs to be replaced by NA & later replaced by missing value analysis. Replace NA is another technique to be used for identifying the outliers.

Correct me if I am wrong.


#4

Hi @swarup17,

Replace NA is not to replace the outliers, but to replace the missing values. It is not a technique to replace outliers.


#5

Ok, Thanks for the info.


#6

Hi Swarup,

Replace NA values for categorical variables using Mode value. Mode value is robust to Outliers where as Mean and Median values are sensitive to outliers.

For Categorical variables, we can not plot Box plot to detect outliers as categorical variables are distict to each others. You can plot Countplot yo know, which category is very frequetist.

Thanks
Ravi


#7

Hello! @swarup17

Yes, Box plot is used to treat outliers for continuous variables.
There is never any outlier for Categorical variable, they might have lesser value for certain category which may be an outlier if you use boxplot and if u remove that it, it might not be a wise choice because you are actually removing a category from that variable.
It is preferred to use “Mode” to replace the missing value of the categorical variable but its a choice though if you have lesser NA to replace because it might increase variance for the that Mode value.
You can also build a Model to predict for those missing value for the categorical variable which I think is more effective when you have lot of missing value.


#8

To plot a box plot it is mandatory to have your data in numeric and continuous.


#9

Modified box plot can be considered.Unlike the standard box plot , a modified box plot does not include the outliers. Instead, the outliers are represented as points beyond the ‘whiskers’, in order to represent more accurately the dispersion of the data.
Modified box plot is default in R.