How do I predict churning out with class imbalance using survival analysis?



Initially I just used classification algorithms to calculate churn-out. But it can predict only whether a customer is going to churn out or not, it cannot predict when the customer is going to churn out or the probability of surviving on a particular future date.

Then I used survival analysis to predict churn-out. I made a cox proportionality hazard model using coxph function from survival package in R (I used all data till 1st August). With that model, I used predictSurvProb function from pec package in R to calculate probability of churning out of all non-churned customers (as on 1st August) on 10th August.

And by using threshold of 0.5 on probabilities to tell whether a customer has churned out or not, I got the following results -

Prediction Accuracies

84.28 % - For all customers (who were not churned out on 1st august)

25.72 % - For customers who actually churned out between 1st to 10th August

Then I checked why I am getting so low accuracy on churned-out customers, I found out that actually there were-

27000 - number of customers who were not churned out as on 1st August

312 - number of members who churned out of the 27000 members above between 1st to 10th August

So definitely there is class imbalance problem. So how do I increase my prediction accuracy on customers who actually churn out?

And if possible how can I use undersampling,oversampling, SMOTE, etc with survival analysis for my problem?


Fastest and easiest approach is to find the proportion and use that as a Threshold/cut off. Not a clean method or scientific method but usually helps for quick view of result. The problem here is the variance will be very high, higher misclassification but with higher coverage of churn in the predicted samples. If positive is chruned out, this method will end up having high false positive.

You will have to use the technique you mentioned to convert this into a balanced data and try modelling. In other words you balance the data first and then perform both survival analysis and prediction.

The other approach i can think is keep survival and classification as two separate analysis and combine to interpret result. Like do the prediction by building a separate model and build a survival analysis and see if combining these two helps.

There is some trial and error involved to an extent based on method deployed. This is my understanding.


This is the confusion matrix of my survival analysis prediction -

I need to somehow increase true positives (if churned out is positive) and decrease false positive. I tried changing threshold of my probability(default = 0.5), but I am getting a tradeoff between accuracy of true positive and the total accuracy.


You have to build features from the data that can help. Try other methods as you will have to play around bias vs variance. You have to introduce some bias if you want to reduce variance.

The problem with churn is that it is generating multiple patterns. You need to see if you can ensemble using more than one model which will continue to impact accuracy. The trade off cannot be eliminated unless the model is trained to read each and every scenario as a pattern.

Your best bet is to try building the accuracy vs say false positive table using different threshold, dropping variables, adding features etc to see if you can hit a fine balance.

Remember, that Churn by pattern vs churn by random is why you are seeing the difference.