I am a bit confused on how data can be split between train/test and “live” data for predicting churn using survival models (the package I am playing with is RandomForestSRC).
Goal of the model is to predict how long a currently active customer will remain a customer before churning.
The current dataset includes customers who have already churned and customers who are active. Say we have 1000 customers who have already churned and 1500 active customers.
My conundrum is that these 1500 active customers (right censored) would be part of the training data set for a survival model. However this dataset also represents the “live” dataset on which the model should be applied to predict when they might churn. That seems wrong because now I end up using the same data to train and then to predict.
Any thoughts on how data can be split for train/test/“live” predict?