Outlier Treatment

#1

I am new to data science and working on a logistic regression project. I have a list of continuous variables like revenue, call usage etc. where I have some outliers. I ran descriptive statistics on these variables and found mean, Std Dev, p95.95%, p99.99%, max, 3SD - UC & LC. I would like to cap the outliers instead of deleting them from the database. So I would like to know the best and scientific approach to choose appropriate capping values based on stats (p95.95%, p99.99%, max, 3SD - UC & LC) provided. I mean whether should I go with 95 percentile 99 percentile or 3 SD. Please help.

#2

But before applying any of the logics /mathematical approaches please look at the behaviors of target variable respect to the outliers of the independent variables. Sometimes you may get some interesting patterns.

#3

Hi, in the book “Applied Predictive Modeling” (Max Kuhn - author of the CARET package) describes an interesting data transformation: “spatial sign”. He writes: "this procedure projects the predictor values onto a multidimensional sphere. This has the effect of making all the samples the same distance from the center of the sphere…) You´l find this on page 34-35 in the a.m. book. I have not tested it myself yet, but the transformation is included in the CARET package and can be easily applied. Best Alex

#4

Thank you Paul. This is really helpful and informative. I would like to know that what should we be doing first in data exploration, missing value treatment or outlier treatment?

#5

Before you cap the extreme values - try to acertain whether these are :

• Measurement or data entry errors
• Data points which are far from rest of the distribution
• Special cases or novelties

Though it is context dependent but, as a general rule, 95% (2 SD) could be a good cutoff limit.Alternatively, you can use (1.5 * Inter Quartile Range) as a measure to detect and cap outliers.