Highly Imbalance Dataset - AUC of 88%



I have a highly imbalance churn dataset i.e 99.5% to .5 % bias. I got an AUC of around 88% with 3 independent variables.The problem is that the probabilities are clustered around .004 to .005. Can I ignore this and order probabilities in descending order and take top N values as people most likely to churn and present it to the business or do i need to do any other validation? It is a boosting model. Logistic regression has also got the same issue but AUC is around 78% with just one independent variable. I do not want to do under or oversampling.

Please advise.


You can devide the churners by non churners in the observed data and use this as a cut off probability for the churners .
If you don’t want to simply over/under sample the data then you can explore simple cascading and ensemble cascading for a more conservative balancing. Hope this helps!!


I had the same problem when i was developing a fraud model where my response rate was 0.4%. I had similar problem with the probability distribution too. Its a common issue in highly imbalanced classes. I managed to solve this with oversampling and bootstrapping. I took several (50)samples with replacement from the non-responders group and all responders in every sample. And ran a logistics regression on each of the sample. And selected best model out of those 50 which performed well on out of sample dataset too. When you look closely, the parameter coefficients are not much affected by oversampling, only the intercept has significant change in its coefficient.


Try treating the problem as an Anamoly detection problem rather than a classification problem. The issue with unbalanced datasets is that there aren’t too many negative examples for the classifier to generalize the rule(s) for classifying negative cases. Azure ML has a module for Anamoly detection. Check out this link: http://gallery.cortanaintelligence.com/Experiment/Anomaly-Detection-Credit-Risk-5?share=1