Using Mean to Guess Availality on Boolean Data before munging

missing_values
#1

On the loan data tutorial : Here
The assumption : “***We can also look that about 84% applicants have a credit_history. How? The mean of Credit_History field is 0.84 (Remember, Credit_History has value 1 for those who have a credit history and 0 otherwise)***” , is correct to which extent?

See this,
image

I think it should be 77% of applicants have credit history. (If we are considering missing values too).

Please correct me if I’m wrong.

#2

Hi @sridar,

It is mentioned that the target variable is 1 and 0. So to find the distribution, here we have simply added the values and it comes out to be 475. Which means we have 475 ones and rest zeros.

Now to calculate the ratio you can divide the total number of ones with the total number of instances. Now the rows with missing values are not considered because those can be either 1 or 0. If you consider that, you are implying that 77% are 1 and 33% are 0.

I’d say, impute the missing values and then consider the total 614.

#3

Suppose I want do imputation of missing values in LoanAmount Column, how can I chose other variables on the basis of association of variables and should it be mean, median or mode imputation?

#4

Hi @bharat79,

Following are the columns we have in the dataset :

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History', 'Property_Area'], dtype='object')

We can use the ‘Education’ and ‘property area’ to get an idea about what the are the income of people with a certain education, living in a certain place. So you can calculate

  • mean of 'educated ’ people in ‘urban’
  • mean of ‘uneducated’ people in ‘urban’
  • mean of ‘educated’ people in ‘rural’ … and so on.

Similarly think of more ways if generating such combinations.

#5

Hi @AishwaryaSingh,

So, generating such combinations would help me in getting an idea about the income class of people which in turn will provide an overview if they would be able to repay the loan back.

#6

@bharat79, I thought you were asking about how to impute the missing value in the income column using a combination of other columns. Although I believe your idea of using these as a new feature should show some improvement in the model predictions as well.

#7

I applied such combinations in order to impute missing values @AishwaryaSingh. Thanks for the help