I am using CART classification technique by dividing a dataset into train and test sets. I have been using Mis-classification error, KS by rank ordering, AUC and Gini as MPMs(model performance measures). The problem I am facing is that the MPM values are quite far apart.
I have tried with minsplit equal to anywhere from 20 to 1400 and minbucket from 5 to 100 but couldn’t get expected results. I have also tried oversampling/undersampling through ROSE package but without any improvement. Moreover, the mis-classification error increased a lot. Following code is through which I could get the best values, but they were not enough.
#Reading Data pdata = read.csv("PL_XSELL.csv", header = TRUE) #Converting ACC_OP_DATE from type factor to date pdata$ACC_OP_DATE<-as.Date(pdata$ACC_OP_DATE, format = "%d-%m-%Y") #Paritioning the data into training and test dataset set.seed(2000) n=nrow(pdata) split= sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.70, 0.30)) ptrain = pdata[split, ] ptest = pdata[!split,] #CART Model #Taking the minsplit, minbucket values as low as possible, so that pruning #can be done later. Higher values didn't allow any scope for pruning r.ctrl = rpart.control(minsplit=20, minbucket = 5, cp = 0, xval = 10) #Calling the rpart function to build the tree cartModel <- rpart(formula = TARGET ~ ., data = ptrain[,-1], method = "class", control = r.ctrl) #Pruning Tree Code cartModel<- prune(cartModel, cp= 0.00225 ,"CP") #Predicting class and scores ptrain$predict.class <- predict(cartModel, ptrain, type="class") ptrain$predict.score <- predict(cartModel, ptrain, type="prob")
Results that I got-: Train data Mis-classification error-.103 AUC - 0.679 KS - 0.259 Gini - 0.313
Test data Mis-classification error-.113 AUC - 0.664 KS - 0.226 Gini - 0.307
Is it due to the dataset or am I doing something wrong. I am new to Data Analytics. It is a part of my academic project, so I need to use CART technique only. I will put separate questions for Random Forest and Neural Networks. Kindly help.