High survival probabilities for un-censored test observations when using RandomForestSRC

r

#1

I am using survival analysis to predict probability of customer churn. I am using the RandomForestSRC package. My test data has some rows that are censored and some that aren’t. When I apply the model on this test data, I see the probability of survival even for un-censored observations is very high. I would have assumed that the survival probability at least for those observations that are uncensored would be pretty low given their event has already occurred. But that doesn’t seem the case. Am I missing something in how these probabilities are interpreted?

Here’s the code I use:

##load data
data(pbc, package = "randomForestSRC")
pbc.trial <- pbc %>% dplyr::filter(!is.na(treatment))
pbc.test <- pbc %>% dplyr::filter(is.na(treatment))

##build model
rfsrc_pbc <- rfsrc(Surv(days, status) ~ .,
               data = pbc.trial,
               na.action = "na.impute")

##test model - test data contains un-censored data
test.pred.rfsrc <- predict(rfsrc_pbc, 
                       pbc.test,
                       na.action="na.impute")

##check times
test.pred.rfsrc$time.interest

##compare survival probabilities
ndf2 <- as.data.frame(test.pred.rfsrc$survival[,11]) #survival probability for 191 days
y2 <- cbind(ndf2, status=pbc.test$status)
mean(dplyr::filter(y2, status==1)[,1]) #expect this to be low because event has occurred- but is 0.95
mean(dplyr::filter(y2, status==0)[,1]) #expect this to be high in comparison - is 0.98