Thanks Ayush for the guidance. However, my EMI/Total Income ratio is very small. I have P=128,r=9.5/(12*100), n=360, emi=1.076 and emi/totalincome=0.000177. am i getting it right?
Loan Prediction, Reveal Your Approach
Hi @buvana.sriram,
Have you taken care of scale??? By scale I mean 

P = 128 can be in thousands or million.

Total income is in Years
The Ratio should ideally be in the same scale like monthly EMI to monthly income. If you have taken care of it, you will be in good shape.
Regards,
Aayush Agrawal
Hello everyone.I am new in this prediction game.Could one of you tell me where i can get the dataset for loan prediction and a sample code if the competition has ended.
Hi @onkarkhaladkar,
The loan prediction problem has been released as a practice problemset. Do check it out! (Click here to go to the loan prediction practice problem). Also, resources are given on the practice page to help beginners understand the āprediction gameā
Hope it helps!
The loan_amount_term is listed as significant when we calculate the chisq statistic with the dependent var But the degree of freedom is 9 and SAS gives a warning that 70% of the cells have expected counts less
than 5. ChiSquare may not be a valid test.
Should we try to club the similar bad rate groups in one and then try to see the chisq stats ?
Any approaches ?
Hey everybody,
I reached accuracy on LB of 0.791667.
Feature Engineering:
1) Used mode values for Gender, Self_employed
2) Imputed values based on conditions for others (like Loan amount, loan tenure, etc)
3) Made two new features:
a) Sum of applicants & Coapplicants income
b) EMI = Loan_Amount / Loan_Amount_Tenure (which doesnāt include Interest and is an approximation).
(Idea is that people who have high EMIās might find it difficult to pay back the loan.)
I used Neural networks with 8 hidden units and sigmoid activation.
Please suggest OR ask anything you likeā¦
Loan Prediction 3
i tried CART/rpart on the training set ,tried 2 approaches :
1)Divided the training set into 2 set (70:30 ratio) ,ran the following :
rpartav2=rpart(formula = Loan_Status ~ ., data = trainloanavp1[, c(1)],
method = āclassā, minbucket = 25)
Got just one split credit history <0.5
left son=2 (75 obs) right son=3 (354 obs)
prp(rpartav2)
Ran the K Fold CV for all the variables except loan_id ;got a cp of 0.36
train(Loan_Status ~ Credit_History + Property_Area + AmtPerMonth,method=ārpartā,trControl=numfolds,tuneGrid=cpgrid,data=trainloanavp1)
ran this on 30% set i had split from training set and got a 79% accuracy
Ran this on the testset i downloaded for Loan Prediction Problem 3;Got an accuracy of 0.7708
2)Did some feature engineering,came up with 2 derived variables
tot_income =Applicantās income +CoapplicantāsIncome
AmtPerMonth=Loan_Amount/Loan_Amount_Term
Found AmtPerMonth to be significant by running Logistic regression
rpart(formula = Loan_Status ~ Credit_History + Property_Area +
AmtPerMonth, data = trainloanavp1, method = āclassā, minbucket = 25)
Ran the K Fold CV for all the variables except loan_id ;got a cp of 0.35
ran this on 30% set i had split from training set Again i got just a single split on the Tree and around the same accuracy
Why am i getting a single split on credit history only is that the only significant variable here ??
Any insights are welcome in analyzing the data.
Hi divye could you please share your work now? I want to do comparison on why my score is low.
My approach so far:
After digging into the data, I concluded that the variables with the most predictive power where Credit History, Property Area and a new feature that I added Loan_Amount/Loan_Amount_term. I used Logistic regression with these variables and came up with a score of 0.784722. I think that the big bet is on how to impute the Credit_History variable. Right now I have all NAs imputed with ā1ā. If anyone has a good idea, pls let us all know! Keep up the good work!