Loan Prediction, Reveal Your Approach



Thanks Ayush for the guidance. However, my EMI/Total Income ratio is very small. I have P=128,r=9.5/(12*100), n=360, emi=1.076 and emi/totalincome=0.000177. am i getting it right?


Hi @buvana.sriram,

Have you taken care of scale??? By scale I mean -

  • P = 128 can be in thousands or million.

  • Total income is in Years

The Ratio should ideally be in the same scale like monthly EMI to monthly income. If you have taken care of it, you will be in good shape.

Aayush Agrawal


Hello everyone.I am new in this prediction game.Could one of you tell me where i can get the dataset for loan prediction and a sample code if the competition has ended.


Hi @onkarkhaladkar,

The loan prediction problem has been released as a practice problemset. Do check it out! (Click here to go to the loan prediction practice problem). Also, resources are given on the practice page to help beginners understand the “prediction game” :slight_smile:

Hope it helps!


The loan_amount_term is listed as significant when we calculate the chisq statistic with the dependent var But the degree of freedom is 9 and SAS gives a warning that 70% of the cells have expected counts less
than 5. Chi-Square may not be a valid test.

Should we try to club the similar bad rate groups in one and then try to see the chisq stats ?

Any approaches ?


Hey everybody,

I reached accuracy on LB of 0.791667.
Feature Engineering:
1) Used mode values for Gender, Self_employed
2) Imputed values based on conditions for others (like Loan amount, loan tenure, etc)
3) Made two new features:
a) Sum of applicants & Co-applicants income
b) EMI = Loan_Amount / Loan_Amount_Tenure (which doesn’t include Interest and is an approximation).
(Idea is that people who have high EMI’s might find it difficult to pay back the loan.)
I used Neural networks with 8 hidden units and sigmoid activation.

Please suggest OR ask anything you like…


Loan Prediction -3
i tried CART/rpart on the training set ,tried 2 approaches :
1)Divided the training set into 2 set (70:30 ratio) ,ran the following :

rpartav2=rpart(formula = Loan_Status ~ ., data = trainloanavp1[, -c(1)],
method = “class”, minbucket = 25)
Got just one split credit history <0.5
left son=2 (75 obs) right son=3 (354 obs)


Ran the K Fold CV for all the variables except loan_id ;got a cp of 0.36
train(Loan_Status ~ Credit_History + Property_Area + AmtPerMonth,method=“rpart”,trControl=numfolds,tuneGrid=cpgrid,data=trainloanavp1)

ran this on 30% set i had split from training set and got a 79% accuracy

Ran this on the testset i downloaded for Loan Prediction Problem -3;Got an accuracy of 0.7708

2)Did some feature engineering,came up with 2 derived variables
tot_income =Applicant’s income +Coapplicant’sIncome

Found AmtPerMonth to be significant by running Logistic regression
rpart(formula = Loan_Status ~ Credit_History + Property_Area +
AmtPerMonth, data = trainloanavp1, method = “class”, minbucket = 25)

Ran the K Fold CV for all the variables except loan_id ;got a cp of 0.35

ran this on 30% set i had split from training set Again i got just a single split on the Tree and around the same accuracy

Why am i getting a single split on credit history only -is that the only significant variable here ??
Any insights are welcome in analyzing the data.


Hi Rajeev i would like to know how you filled NA in Credit_History


Hi divye could you please share your work now? I want to do comparison on why my score is low.


My approach so far:
After digging into the data, I concluded that the variables with the most predictive power where Credit History, Property Area and a new feature that I added Loan_Amount/Loan_Amount_term. I used Logistic regression with these variables and came up with a score of 0.784722. I think that the big bet is on how to impute the Credit_History variable. Right now I have all NAs imputed with “1”. If anyone has a good idea, pls let us all know! Keep up the good work!