I am using survival analysis to predict probability of customer churn. I am using the RandomForestSRC package. My test data has some rows that are censored and some that aren’t. When I apply the model on this test data, I see the probability of survival even for un-censored observations is very high. I would have assumed that the survival probability at least for those observations that are uncensored would be pretty low given their event has already occurred. But that doesn’t seem the case. Am I missing something in how these probabilities are interpreted?
Here’s the code I use:
##load data data(pbc, package = "randomForestSRC") pbc.trial <- pbc %>% dplyr::filter(!is.na(treatment)) pbc.test <- pbc %>% dplyr::filter(is.na(treatment)) ##build model rfsrc_pbc <- rfsrc(Surv(days, status) ~ ., data = pbc.trial, na.action = "na.impute") ##test model - test data contains un-censored data test.pred.rfsrc <- predict(rfsrc_pbc, pbc.test, na.action="na.impute") ##check times test.pred.rfsrc$time.interest ##compare survival probabilities ndf2 <- as.data.frame(test.pred.rfsrc$survival[,11]) #survival probability for 191 days y2 <- cbind(ndf2, status=pbc.test$status) mean(dplyr::filter(y2, status==1)[,1]) #expect this to be low because event has occurred- but is 0.95 mean(dplyr::filter(y2, status==0)[,1]) #expect this to be high in comparison - is 0.98