Logistic Regression From Scratch for Loan Prediction

python

#1

Hi, Freinds

I want an urgent help on implementing a logistic regression equation for the loan prediction data set. The approach given below in the provided link is just calling the model and passing it through the function for the result but I want to code my logistic regression algorithm here instead of the pre-built model. I request you to please help me out as I’ll be very much thankful to you all.


#2

Hello @aaron11
You could implement the Logestic Regression like this -->

# Imports
from sklearn.linear_model  import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix,accuracy_score

logmodel = LogisticRegression()        #  Logistic model declaration
logmodel.fit(X_train,y_train)          #  Training model on X_train
predictions = logmodel.predict(X_test) #  Predicting o n X_test
print(confusion_matrix(y_test,predictions)) # Scores and reports
print(classification_report(y_test,predictions))
print(accuracy_score(y_test,predictions)*100)

#3

Hi,

Thank you for the reply but I need the logistic regression to be worked without a library. I have found one code, will you please look into it and help me out to work with loan prediction problem?


#4

Hi @aaron11,

I recommend you understand the Machine Learning process of training and testing before directly jumping into implementation from scratch. Try and understand the following –

  1. What is training, Log-likely hood in Machine learning?
  2. Gradient, learning rate and how weights change from initial to final state
  3. How prediciton works ?

The link you sent has complete scratch implementations of the mentioned concepts. If understood well you will have no problem implementing the same for your problem case.

The exact code change for your implementation should be done at cells 43 and 157 in the above notebook. And, I highly recommend you that you code this into a seperate file to understand what each function does so that you can compare your algorithm results from that of scikit-learn’s implementation. Let us know if you have done it. Cheers!


#5

Please check and guide me


#6

@AishwaryaSingh Will you please look into it and can help me out? I’ll be very much thankful to you


#7

Hey @aaron11,

Can you share the notebook? I’ll DM you the mail ID.


#8

import math as m
from math import *

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import pylab
import scipy
import sklearn
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold #For K-fold cross validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics

def classification_model(model, data, testdf, predictors, outcome):
# Fit the model:
model.fit(data[predictors], data[outcome])

# Make predictions on training set:
predictions = model.predict(data[predictors])

# Print accuracy
accuracy = metrics.accuracy_score(predictions, data[outcome])
#print "Accuracy : %s" % "{0:.3%}".format(accuracy)

# Perform k-fold cross-validation with 5 folds
kf = KFold(data.shape[0], n_folds=5)
error = []
for train, test in kf:
    # Filter training data
    train_predictors = (data[predictors].iloc[train, :])

    # The target we're using to train the algorithm.
    train_target = data[outcome].iloc[train]

    # Training the algorithm using the predictors and target.
    model.fit(train_predictors, train_target)

    # Record error from each cross-validation run
    error.append(model.score(data[predictors].iloc[test, :], data[outcome].iloc[test]))

print “Cross-Validation Score : %s” % “{0:.3%}”.format(np.mean(error))

# Fit the model again so that it can be refered outside the function:
model.fit(data[predictors], data[outcome])
return model.predict(testdf[predictors])

 # Read train file

df=pd.read_csv(‘train_u6lujuX_CVtuZ9i.csv’)
# Read test file
testdf=pd.read_csv(‘test_Y3wMUE5_7gLdaTN.csv’)

 # Impute missing data in train file with mean values

df[‘LoanAmount’].fillna(df[‘LoanAmount’].mean(), inplace=True)
df[‘Self_Employed’].fillna(‘No’, inplace=True)
df[‘Loan_Amount_Term’].fillna(df[‘Loan_Amount_Term’].mean(),inplace=True)
df[‘Credit_History’].fillna(1,inplace=True)
df[‘Married’].fillna(‘Yes’,inplace=True)
df[‘Gender’].fillna(‘Male’,inplace=True)
df[‘Dependents’].fillna(‘0’,inplace=True)
df[‘TotalIncome’] = df[‘ApplicantIncome’] + df[‘CoapplicantIncome’]
df[‘TotalIncome_log’] = np.log(df[‘TotalIncome’])

# Impute missing data in test file with mean values

testdf[‘LoanAmount’].fillna(df[‘LoanAmount’].mean(), inplace=True)
testdf[‘Self_Employed’].fillna(‘No’, inplace=True)
testdf[‘Loan_Amount_Term’].fillna(df[‘Loan_Amount_Term’].mean(),inplace=True)
testdf[‘Credit_History’].fillna(1,inplace=True)
testdf[‘Married’].fillna(‘Yes’,inplace=True)
testdf[‘Gender’].fillna(‘Male’,inplace=True)
testdf[‘Dependents’].fillna(‘0’,inplace=True)
testdf[‘TotalIncome’] = df[‘ApplicantIncome’] + df[‘CoapplicantIncome’]
testdf[‘TotalIncome_log’] = np.log(df[‘TotalIncome’])

#print(df.apply(lambda x: sum(x.isnull()), axis=0))

# Used LabelEncoding to convert all categorical values into numeric

var_mod = [‘Gender’,‘Married’,‘Dependents’,‘Education’,‘Self_Employed’,‘Property_Area’,‘Loan_Status’]
le=LabelEncoder()
for i in var_mod:
df[i]=le.fit_transform(df[i])
if(i!=‘Loan_Status’):
testdf[i]=le.fit_transform(testdf[i])

df[‘CoapplicantIncome’] = df[‘CoapplicantIncome’].astype(np.int64)
df[‘LoanAmount’]=df[‘LoanAmount’].astype(np.int64)
df[‘Loan_Amount_Term’]=df[‘Loan_Amount_Term’].astype(np.int64)
df[‘Credit_History’]=df[‘Credit_History’].astype(np.int64)
df[‘TotalIncome’]=df[‘TotalIncome’].astype(np.int64)
df[‘TotalIncome_log’]=df[‘TotalIncome_log’].astype(np.int64)
df[‘LoanAmount_log’] = np.log(df[‘LoanAmount’])

testdf[‘CoapplicantIncome’] = testdf[‘CoapplicantIncome’].astype(np.int64)
testdf[‘LoanAmount’]=testdf[‘LoanAmount’].astype(np.int64)
testdf[‘Loan_Amount_Term’]=testdf[‘Loan_Amount_Term’].astype(np.int64)
testdf[‘Credit_History’]=testdf[‘Credit_History’].astype(np.int64)
testdf[‘TotalIncome’]=testdf[‘TotalIncome’].astype(np.int64)
testdf[‘TotalIncome_log’]=testdf[‘TotalIncome_log’].astype(np.int64)
testdf[‘LoanAmount_log’] = np.log(df[‘LoanAmount’])

#print(df.dtypes)
outcome_var = ‘Loan_Status’
model = LogisticRegression()
predictor_var = [‘Credit_History’,‘Education’,‘Married’,‘Self_Employed’,‘Property_Area’]
p=classification_model(model, df,testdf,predictor_var,outcome_var)
testdf[‘Loan_Status’]=p
df_final=testdf.drop([‘Gender’, ‘Married’, ‘Dependents’, ‘Education’, ‘Self_Employed’, ‘ApplicantIncome’, ‘CoapplicantIncome’, ‘LoanAmount’,
‘Loan_Amount_Term’, ‘Credit_History’, ‘Property_Area’,‘TotalIncome’,‘TotalIncome_log’],axis=1)

df_final[‘Loan_Status’]=df_final[‘Loan_Status’].map({0:‘N’, 1:‘Y’})
df_final.to_csv(‘sample_submission.csv’, index=False)