Which Logistic Regression model should be selected among several models?



I am performing logistic regression for the telecom churn data.

There are around 10,000 records of it. There is no missing value in the dataset. While performing univariate analysis, I found few variables like AVG_BILL_AMOUNT_3MONTHS, VOICE_LOC_INC_TOT and other have outliers on both side. so calculated the outliers with following code

AVG_BILL_AMOUNT_3MONTHSoutliershigh1 = quantile(entchurndata_old$AVG_BILL_AMOUNT_3MONTHS,0.75) + (IQR(entchurndata_old$AVG_BILL_AMOUNT_3MONTHS) * 1.5 ) 

and replaced with value using for loop. Should I build Logistic regression model without replacing any outliers or should I need to fix the outliers first and perform logistic regression model. I have also built a random forest using all the variables of the telecom churn data. Do I need to take only positively affecting variables from feature importance chart of Random forest to build the logistic regression?

I have other questions regarding model validation.

  1. First time, I have built logistic regression using all the meaningful variables.
  2. Second time I have used the variables which has p value less than 0.05.
  3. Third time I have built LR using variables with p-value less than 0.05 and adding the interaction of lowest p-value variable with other variables.
  4. I have build around nine models with different set of the variables.

I have used AIC, residual deviance, AUC, confusion matrix and ROCR as metrics for model validation.
Mmeasuring the predictive ability and accuracy of the test data for the models is close to 0.94…

Model4 has lowest AIC value.
Model1 has lowest residual deviance.
Model3 has highest TPR and accuracy of 0.947 from confusion matrix
Model4 has highest AUC value

from the above which model is best?.

Please provide some suggestions on model cross validation methods and best sampling technique.



First things first, you should take a call whether the outliers actually need any treatment or they actually contain meaningful information, which you want to segment out? If there are too many values outside the bounds, you can think of creating this population as a separate population all together.

If not, then you can treat the outliers.

Next, if the values of all the 4 models is this close, the the call for final selection should be guided by ease of implementation and simplicity.

So, I would remove the interaction terms and keep only the significant variables (as mentioned in step 2). This should give you the best result from business perspective.



I really Appreciate your help. As per my understanding we have to see if the variable contains outlier or not using boxplot. How to take a call whether the outliers actually need any treatment ?. This data is given by business. As per them this is real data. As i told earlier few variables contains outliers on both the sides and size of this outlier is less than 5% of the data. Having said that, can i ignore outliers(5% of the data) or treat them ?.
Regarding segmentation, let us say i have 9 variables, out of that 5 variables have outliers on both the side. can i use kmeans clustering with k=5 and subsetting only 5 variables as dataset while performing kmeans clustering.
If you have any good links on outlier treatment and segmentation, please paste the link.