Seeking methods to compare the techniques used in Outlier treatment

dataexploration
r

#1

While applying outlier treatment methods in different situations, I realised that I had no measure to compare the different methods? Can anyone suggest a means to compare the percentage of outliers that were reduced by applying methods like log transformation, taking cube root, etc.

Also, what is the threshold beyond which we need to model outliers distinctly?

Thanks in advance!


#2

@shashwat.2014

Let us take some random data and try to analyse it pre and post applying transformation

data<- data.frame(x=seq(1,1000))

# Normal Random Data
data$y=rnorm(1000,4,1)
plot(data$x,data$y)

# Find the curve assuming that we know nothing of the normal distribution
model <- smooth.spline(data$x,data$y)
plot(data)
lines(model,col="blue")

# This method is entirely on the basis of distance from the smooth line
data$mean <- predict(model,data$x)$y
data$diff <- ((data$y - data$mean)/data$mean)^2
data <- data[order(-data$diff),]

# I will just plot the top 10 deviations ( Either i can show top deviations or deviations that are above a particular threshold)
plot(data$x,data$y)
lines(model,col="blue")
points(data[1:10,]$x,data[1:10,]$y,col="red")

# Now I will apply some transformation

######### LOG ####################
data$y <- log(data$y)

# Find the curve assuming that we know nothing of the normal distribution
model <- smooth.spline(data$x,data$y)
plot(data)
lines(model,col="blue")

# This method is entirely on the basis of distance from the smooth line
data$mean <- predict(model,data$x)$y
data$diff <- ((data$y - data$mean)/data$mean)^2
data <- data[order(-data$diff),]

# I will just plot the top 10 deviations ( Either i can show top deviations or deviations that are above a particular threshold)
plot(data$x,data$y)
lines(model,col="blue")
points(data[1:10,]$x,data[1:10,]$y,col="red")

You will see that the range of the deviations has changed. This will be true for every transformation. Based on the nature of the data, we have to choose the most appropriate transformation.

The metric to be used can be

    1. Deviation from the best fit curve ( which i have used here )
    1. Cluster Analysis ( k means )

If I apply k means clustering, then
data$z <- scale(data$y)
kmeans(data$z,6)

K-means clustering with 6 clusters of sizes 234, 244, 139, 216, 113, 54

Cluster means:
        [,1]
1  0.4890484
2 -0.1350999
3  1.1713196
4 -0.7783585
5 -1.6766376
6  2.0981320

Within cluster sum of squares by cluster:
[1]  7.264497  7.244164  6.677956 10.098386 23.295321  7.060033
 (between_SS / total_SS =  93.8 %)

After the transformation
data$z <- scale(data$y)
kmeans(data$z,6)
K-means clustering with 6 clusters of sizes 114, 206, 254, 278, 132, 16

Cluster means:
         [,1]
1 -1.53359075
2 -0.60790059
3  0.02595344
4  0.62930994
5  1.35159480
6 -3.74337400

Within cluster sum of squares by cluster:
[1] 14.726094  9.513906  7.014095  9.545692 10.416358 27.929292
 (between_SS / total_SS =  92.1 %)

There is a reduction of deviation that is explained by the cluster. We can conclude that the transformation was not appropriate

Hope this will give some idea

Regards,
Anant


#3

Is your hypothesis applicable on other datasets too?


#4

Hi @jalFaizy

I mean what I wanted to convey is that it is difficult to come up with a single metric that will explain whether the transformation is good.
Outliers are always analyzed visually. Once we do that we get a feel of the kind of data and then then we can automate the outlier removal but the initial analysis needs to be done

I did something with financial data and this method was not applicable there
I did something with unix inode analysis and this method was applicable there :slight_smile:

Regards,
Anant


#5

@anantguptadbl
Thanks a lot for such a detailed answer. I would definitely explore these techniques in my further analysis.