Loan Prediction, Reveal Your Approach



Kindly reveal your approach in Loan Prediction Competition.
@divye_gupta I would love to read your approach.

Leaderboard scoring - clarification needed

Buddy, it got extended till Feb 9. Keep trying!!!


As the time is extended i recommend you to try yourself first and once the deadline is passed i would try to explain my approach to the problem.




I am new to these competitions and I am stuck at the very basic .77 score. I tried Ensemble, bagging etc. I used mice to fill in missing data.
I think my problem is that I lack creativity in Feature Engg. Can someone help?

In the case of this loan data, I could not identify any specific patterns. The most significant columns Credit_History and Property_Area are resulting in a lot of False Positives.
For instance why would the Loan_Status be N for the following loan ID’s --LP001003,LP001038,LP001259

What is the hypothesis definition for this case study?

Hi @monica_joseph ,

Its not about tools its more about technique in this competition. I scored 0.78472 just by doing Y/N based on my understanding and findings in excel. Work towards adding new features, think about the business problem. Also, you need to prevent models from over-fitting as the data set is having only ~700 rows.

Good luck and try hard.



Thank you @aayushmnit,

Your Ideas were helpful.I used Pivots on Excel and arrived at a vector, based on my analysis.
From .7777 I improved to 0.78472. Now I am stuck here…

Let me reveal some of my analysis. I hope it is treated as ethical… and anyways I seem to be miles away from the leaderboard:

  1. Added a column to sum up the Applicant and coapplicant Income
  2. Created a Ratio column for loan_amount/sum_income
  3. Added a rank vector based on the following findings:
    ++ Being a graduate,not self employed, married and less dependents
    – Ratio more than 6, being unmarried, having blank gender
    – being male, income less than 6000 and not graduate
    – female and self_employed
    Next…I’ll try to simplify my ranking vector… i guess it might be causing some overfitting, because my model does not seem to improve any further

Please let me know your thoughts…



Let me tell you first you have done an awesome job here! You might want to include loan tenure also into your analysis as in the end the applicant is paying an EMI per month not the whole loan amount at once :slight_smile: Well try to enjoy the problem solving phase by business context.

Once you are done with it. Use these features in advanced modelling techniques like Decision trees, Randomforest, Xgboost , GBM techiques and work on not to overfit by regularizing. Also remember, the public LB is a bit deceptive :slightly_smiling: its only using 50% of the data not all , so you might overfit here but final ranking is on the remaining 50%.

Good luck!


Can anyone share the data set with me for the problem. I have not registered but would love to try



I think you can register for the Contest and download the dataset from Contest’s Page, if they are allowing to Register now.
So, Register for the problem and download the dataset.


I am new to data science cn you please help me out with the loan prediction problem I mean how you create the rank predicator

it would be a great help



Lets focus on finding Golden Features. Understanding the pattern of data improves accuracy. LB scores fall in place once we understand pattern and generalize this…


Hi @shan4224 ,
could you please let me know how (only method) to identify the Golden Features.

Thank you…



Its little difficult to find any only method.We can start by exploring descriptive statistics, plottings. The idea is to find out specific patterns which better explain target variable. We can try out polynomials, transformations etc. After that we can analyze by plotting the same. A visual inspection can confirm .
Hope it helps…
Thanks and Regards,


@aayushminit …Loan tenure might be very significant for prediction purpose But then for calculating you require

  • Loan Amount --[Available in data set]
  • EMI --[Not Available monthly data]
  • Interest rate --[Not Available in data set]
  • start and end date of Loan started --[Not Available in data set]

We have Only Loan Amount available with us .rest of the parameter we don’t have

What features did you engineer in the loan prediction data set?

First of all you have not tagged me properly. Anyways, at the time I answered the question the competition was still running, so I just dropped some hints there and not the whole methodology :slight_smile: Coming to how you can use the loan tenure in the model building process , here’s my approach for it -

  1. Checked that the problem is about home loan and in India the home loan Interest rate is generally in 9-10% range . So I assumed that its 9.5% for all the customers
  2. Calculated EMI using this formula EMI = [P x R x (1+R)^N]/[(1+R)^N-1], Where P is principle amount(available), R is rate of Interest(assumed) and N is Loan tenure(avaiable)
  3. Created a ratio of EMI to total income

The idea behind this approach is that greater the EMI/Income ratio lesser will be the chances of the person to get a loan approved from bank. Then you can use this new feature in any kind of model you are building.

How I like to solve a problem is to first understand the business we are dealing with and then think of factors without looking at the dataset available.Even if the fields you want are not available, I think of ways of capturing that information with some trade off in accuracy. Hope you get what I am talking about.

Aayush Agrawal


My approach as of now

  1. Did data exploration using pivots and found credit history seems to be having maximum predicting power. I do believe income and loan amount has a role to play as well but kept them aside for a while.

  2. Used rpart in R to train my model and applied on test data set - my score was around 0.77

  3. Looked into missing values and found big chunk of missing values for credit history in test data set so again did analysis to fill up credit history missing values. This approach gave me score of 0.7986 (quite a big jump)

  4. Now will look up into applicant income and loan term variables.

Will keep posted my finding… happy predicting !!


Hi, everyone
i want to download the data set of Loan Prediction problem, but i failed to get it
Could you send me the dataset to my email, thanks in advance!
my email address: (#changed to @)


hey @rajeevpareek1
I am new to data analysis .
I am not able to understand what is good way to fill up the missing values. Credit History had a lot of missing values .
Can you please help me out by telling your approach for missing values.
Thanks in advance.


@Avior, haven’t had too much experience in statistical analysis myself. Having said that, here’s one of the approaches. Not relating to this specific example, however, if I had a continuous numeric variable, I would use a mean value for that variable to populate missing values. Alternatively, if there was a nominal categorical variable like location, I would use the modal value (most frequently occurring value) as a value to populate missing values.

More than happy for other participants to add to my knowledge on this.




@Avior - I am also new to R and data science and found the AV blog on missing value imputation and outlier detection really helpful - . I am not sure if you are working on any specific language(R or python), however you can also google for R-bloggers article on missing value treatment. Its really good.