i tried CART/rpart on the training set ,tried 2 approaches :
1)Divided the training set into 2 set (70:30 ratio) ,ran the following :
rpartav2=rpart(formula = Loan_Status ~ ., data = trainloanavp1[, -c(1)],
method = “class”, minbucket = 25)
Got just one split credit history <0.5
left son=2 (75 obs) right son=3 (354 obs)
prp(rpartav2)
Ran the K Fold CV for all the variables except loan_id ;got a cp of 0.36
train(Loan_Status ~ Credit_History + Property_Area + AmtPerMonth,method=“rpart”,trControl=numfolds,tuneGrid=cpgrid,data=trainloanavp1)
ran this on 30% set i had split from training set and got a 79% accuracy
Ran this on the testset i downloaded for Loan Prediction Problem -3;Got an accuracy of 0.7708
2)Did some feature engineering,came up with 2 derived variables
tot_income =Applicant’s income +Coapplicant’sIncome
AmtPerMonth=Loan_Amount/Loan_Amount_Term
Found AmtPerMonth to be significant by running Logistic regression
rpart(formula = Loan_Status ~ Credit_History + Property_Area +
AmtPerMonth, data = trainloanavp1, method = “class”, minbucket = 25)
Ran the K Fold CV for all the variables except loan_id ;got a cp of 0.35
ran this on 30% set i had split from training set Again i got just a single split on the Tree and around the same accuracy
Why am i getting a single split on credit history only -is that the only significant variable here ??
Any insights are welcom