Loan Prediction 3 - Questions on applying the model




I am new to data science world.I am trying to do the Loan prediction 3 challenge here.

I have successfully cleaned the data set, done some feature engineering and come up with a model using sklearn in python(RF).

However I am confused now -

1)How do I run the model on the training data?Since I added some new features to training data and cleaned it by removing NaN values.Do I have to clean test data as well?
2)Should I use cross validation in this case?

Thanks a lot in advance


You have to train the model on traning data set using fit method in sklearn after initializing all the parameters. You can create a simple random forest using the folllowing code :

from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier() clf =, Y)

This will create a RF model uisng the default parameters in sklearn. Here arises a question about how to choose the best parameters for our RF model. Go through this article for parameter tuning in RF.

For the question about cleaning the test data as well, I have to know what kind of cleaning did you performed on the train data. If you simply deleted the observations with NaN in training data then fitted your model on it, you cannot do it with your test data set as the problem statement requires prediction on all the test observations.
On the other hand if you have created a new feature or transformed any feature you have to do the similar operations in your test data set as the model that you fitted on the train data considers the information from your new or transformed feature.

Cross-validation is a technique for choosing the best parameters and avoid over fitting. There are many methods present in sklearn for cross validation like k-fold, grid search, random search. Check out the article given below for exploring cross validation.

Thats all from my side. Hope this helps.


thanks a lot for the help!In cleaning i replaced the missing values with median and also did some one hot encoding.Will I have to do the same to my test data as well?


Yes you have to do the similar with test data.


i tried CART/rpart on the training set ,tried 2 approaches :
1)Divided the training set into 2 set (70:30 ratio) ,ran the following :

rpartav2=rpart(formula = Loan_Status ~ ., data = trainloanavp1[, -c(1)],
method = “class”, minbucket = 25)
Got just one split credit history <0.5
left son=2 (75 obs) right son=3 (354 obs)


Ran the K Fold CV for all the variables except loan_id ;got a cp of 0.36
train(Loan_Status ~ Credit_History + Property_Area + AmtPerMonth,method=“rpart”,trControl=numfolds,tuneGrid=cpgrid,data=trainloanavp1)

ran this on 30% set i had split from training set and got a 79% accuracy

Ran this on the testset i downloaded for Loan Prediction Problem -3;Got an accuracy of 0.7708

2)Did some feature engineering,came up with 2 derived variables
tot_income =Applicant’s income +Coapplicant’sIncome

Found AmtPerMonth to be significant by running Logistic regression
rpart(formula = Loan_Status ~ Credit_History + Property_Area +
AmtPerMonth, data = trainloanavp1, method = “class”, minbucket = 25)

Ran the K Fold CV for all the variables except loan_id ;got a cp of 0.35

ran this on 30% set i had split from training set Again i got just a single split on the Tree and around the same accuracy

Why am i getting a single split on credit history only -is that the only significant variable here ??
Any insights are welcom


Hi all,
Ive just joined, and was wondering how to get loans dataset, coz we ar pretty struggling to find it!
what Im sugsting here if any one already have the dataset share it with the rest (New joiners) to ble to work on it , would be much appreciated guys!

thanks alot for your help


You can download the dataset from the contest page only.


seems doesnt work at all again !

would you please pass the dataset to my private email , would be much appreciated !

Thanks alot


its working. i tried just now