I am new to data science and working on a logistic regression project. I have a list of continuous variables like revenue, call usage etc. where I have some outliers. I ran descriptive statistics on these variables and found mean, Std Dev, p95.95%, p99.99%, max, 3SD - UC & LC. I would like to cap the outliers instead of deleting them from the database. So I would like to know the best and scientific approach to choose appropriate capping values based on stats (p95.95%, p99.99%, max, 3SD - UC & LC) provided. I mean whether should I go with 95 percentile 99 percentile or 3 SD. Please help.
Please go through the link below,
But before applying any of the logics /mathematical approaches please look at the behaviors of target variable respect to the outliers of the independent variables. Sometimes you may get some interesting patterns.
Hi, in the book “Applied Predictive Modeling” (Max Kuhn - author of the CARET package) describes an interesting data transformation: “spatial sign”. He writes: "this procedure projects the predictor values onto a multidimensional sphere. This has the effect of making all the samples the same distance from the center of the sphere…) You´l find this on page 34-35 in the a.m. book. I have not tested it myself yet, but the transformation is included in the CARET package and can be easily applied. Best Alex
Thank you Paul. This is really helpful and informative. I would like to know that what should we be doing first in data exploration, missing value treatment or outlier treatment?
Before you cap the extreme values - try to acertain whether these are :
- Measurement or data entry errors
- Data points which are far from rest of the distribution
- Special cases or novelties
Though it is context dependent but, as a general rule, 95% (2 SD) could be a good cutoff limit.Alternatively, you can use (1.5 * Inter Quartile Range) as a measure to detect and cap outliers.