Splitting between train/test for customer churn survival models



I am a bit confused on how data can be split between train/test and “live” data for predicting churn using survival models (the package I am playing with is RandomForestSRC).

Goal of the model is to predict how long a currently active customer will remain a customer before churning.

The current dataset includes customers who have already churned and customers who are active. Say we have 1000 customers who have already churned and 1500 active customers.

My conundrum is that these 1500 active customers (right censored) would be part of the training data set for a survival model. However this dataset also represents the “live” dataset on which the model should be applied to predict when they might churn. That seems wrong because now I end up using the same data to train and then to predict.

Any thoughts on how data can be split for train/test/“live” predict?


Hi @girish

It sounds as a common problem in business. So if you use live data to build your model and then in validation you will have a bias in term of result, which is better prediction than expected. Now you are live therefore you have 100 times more customers live that the live one you used to build your model (hopefully), there the result of your 1/100 you use in building the model will be “diluted” and you know already the probability for this type through the train test result, therefore you can do a correction on you live results.
In few words, the influence will be small (you should test the significance) if the live is lot bigger (simple arithmetic will tell you this, you can even had a CI)
Hope this help.


@Lesaffrea Thank for the reply Alain. The problem I am working on is of Business-to-Business rather than Business-to-Consumer. Hence in my case the number of customers are not very large. They are around 1000 and new customers are added at the rate of 10 to 20 per month.

I am assuming your suggestion would work well when there are a large number of customers, but may not work in my case. Did I get that right?


Hi Girish

Yes, if I understand well you want to use all the live in your train, there you will have a problem, not only because of accuracy or any metric you use but because some will churn as well and perhaps very soon, they are even perhaps in the churns stage as you have latency. (delay between the type one event happen and the customer behaviour)

Usually in business we have timestamp when parameters changed, do you have access to the records of the people than churns before they churn ? It means do you have one history of the churn customer, then will will work with history.

Hope this help.



I will look for the timestamp in my data. That makes sense to me. Perhaps to use data from earlier timestamp “before they churn” and using such data for testing. Thanks for your replies.