Outlier treatment for Predictive Modeling

outliers
predictive_model

#1

Hi,

I am predicting business sourced by sales agent in next 3 months and while looking at various dependent variables I also realized that there is a small set of agents, whose performance is 10 times higher compared to average performance of agents. I have marked performance of these agents as outliers. Please help me with treatment methods to handle outliers.

Regards,
Imran


#2

Imran,

The first thing you need to check is how big / small is the population of these agents and whether they are big enough to be a separate category of agents. Deciding would this involve some judgement. For example, if there are only 1 or 2 agents in a population of 1000+ agents, you can think of them as Outliers. On the other hand, if you have 10 such agents, you may want to treat them (and other agents like them) completely differently.

Traditionally Outliers are population outside the 1.5*IQR (Inter Quartile Rnage) from the upper of lower quartile. If you have a spread of population between average value and the outliers, you can also think about using transformation (e.g. Log Transfrmations)

Assuming that the outlier population is small and the technique you are using for modeling actually needs outlier treatment (not every technique would need it e.g. Decision trees wouldn’t), there are several ways to go about this treatment:

  • Replace these outliers by the maximumm value among the non-outliers
  • If you have large enough popluation, you can also remove the outliers from the population completely
  • If you think that the Oulier is because of data capture errors and is not reflective of actual performance, you can even substitute the numbers with mean or median

Hope this gives you some ideas to take this forward

Kunal


#3

I have column named VOICE_LOC_INC_TOT which has outliers on lower and higher side. how to replace the value with (quantile(VOICE_LOC_INC_TOT,0.25) - IQR(VOICE_LOC_INC_TOT) *1.5) for lower side. I want to replace the value of the column VOICE_LOC_INC_TOT in R.

following is my R code.
VOICE_LOC_INC_TOToutlierslow = quantile(entchurndata$VOICE_LOC_INC_TOT,0.25) - (IQR(entchurndata$VOICE_LOC_INC_TOT) * 1.5 )

VOICE_LOC_INC_TOToutliershigh = quantile(entchurndata$VOICE_LOC_INC_TOT,0.75) + (IQR(entchurndata$VOICE_LOC_INC_TOT) * 1.5 )

entchurndata[which(entchurndata$VOICE_LOC_INC_TOT < quantile(entchurndata$VOICE_LOC_INC_TOT,0.25) - (IQR(entchurndata$VOICE_LOC_INC_TOT) * 1.5)),]

entchurndata[which(entchurndata$VOICE_LOC_INC_TOT > quantile(entchurndata$VOICE_LOC_INC_TOT,0.75) + (IQR(entchurndata$VOICE_LOC_INC_TOT) * 1.5)),]

entchurndata$VOICE_LOC_INC_TOT_1 <- ifelse(entchurndata$VOICE_LOC_INC_TOT <= VOICE_LOC_INC_TOToutlierslow,VOICE_LOC_INC_TOToutlierslow,entchurndata$VOICE_LOC_INC_TOT)

entchurndata$VOICE_LOC_INC_TOT_1 <- ifelse(entchurndata$VOICE_LOC_INC_TOT >= VOICE_LOC_INC_TOToutliershigh,VOICE_LOC_INC_TOToutliershigh,entchurndata$VOICE_LOC_INC_TOT)