I am performing logistic regression for the telecom churn data.
There are around 10,000 records of it. There is no missing value in the dataset. While performing univariate analysis, I found few variables like AVG_BILL_AMOUNT_3MONTHS, VOICE_LOC_INC_TOT and other have outliers on both side. so calculated the outliers with following code
AVG_BILL_AMOUNT_3MONTHSoutliershigh1 = quantile(entchurndata_old$AVG_BILL_AMOUNT_3MONTHS,0.75) + (IQR(entchurndata_old$AVG_BILL_AMOUNT_3MONTHS) * 1.5 )
and replaced with value using for loop. Should I build Logistic regression model without replacing any outliers or should I need to fix the outliers first and perform logistic regression model. I have also built a random forest using all the variables of the telecom churn data. Do I need to take only positively affecting variables from feature importance chart of Random forest to build the logistic regression?
I have other questions regarding model validation.
- First time, I have built logistic regression using all the meaningful variables.
- Second time I have used the variables which has p value less than 0.05.
- Third time I have built LR using variables with p-value less than 0.05 and adding the interaction of lowest p-value variable with other variables.
- I have build around nine models with different set of the variables.
I have used AIC, residual deviance, AUC, confusion matrix and ROCR as metrics for model validation.
Mmeasuring the predictive ability and accuracy of the test data for the models is close to 0.94…
Model4 has lowest AIC value.
Model1 has lowest residual deviance.
Model3 has highest TPR and accuracy of 0.947 from confusion matrix
Model4 has highest AUC value
from the above which model is best?.
Please provide some suggestions on model cross validation methods and best sampling technique.