Creating confusion mattrix on Loan Prediction dataset

confusion_matrix
python

#1

I want to make confusion matrix out of loan prediction data set, will somebody out there help me. I am referring to this topic


#2

Import confusion matrix from sklearn. Then pass the actual values and predictions as arguments to the
confusion_matrix function. Then print the confusion matrix.

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(actual_values, predicted_values)
cm


#3
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(data[outcome] , predictions)
print(cm)

#4


#5

Hi @aaron11,

The code provided by @A.Malathi is exactly what you need! In your code, you are using Credit History instead of the actual values. The confusion matrix is a table that describes the performance of a classification model. You would have to use the predicted values and the true values to print the confusion matrix.


#6

Aishwarya, Will you please help me out in printing the confusion matrix. I’m not getting the result I have tried all possible way


#7

Will you please help me to Identify Attributes? For the actual and predicted values


#8

Hi @aaron11,

Follow these steps

  1. Load the dataset

    df = pd.read_csv('train.csv)
    
  2. Impute the missing values

  3. Create dummies

  4. Split the dataset into train and test using the below code

    from sklearn.model_selection import train_test_split
    train, test = train_test_split(df, test_size=0.3, random_state=0)
    
    x_train=train.drop('Loan_Status',axis=1)
    y_train=train['Loan_Status']
    
    x_test=test.drop('Loan_Status',axis=1)
    y_test=test['Loan_Status']
    
  5. Fit a model

    from sklearn import tree
    model = tree.DecisionTreeClassifier(random_state=1) 
    model.fit(x_train, y_train)
    
  6. Predict the values on test set

    pred = model.predict(x_test)
    
  7. Plot the confusion matrix

     from sklearn.metrics import confusion_matrix
    
    cm=confusion_matrix(y_test ,pred)
    print(cm)
    

    My output looks like

    [[ 26 25]
    [ 26 108]]

  8. For better visualisation, go for this code

    cm = pd.crosstab(y_test, pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
    cm
    

Result

Predicted 0 1 All
Actual
0 27 24 51
1 21 113 134
All 48 137 185

Hope this helps!


#9


#10

I’m sorry but I’m new to all this and learning for the very first time that’s why just need spoon feed assistance. You’ve helped me alot please @AishwaryaSingh Please look into this


#11

Looks like you have not performed one-hot encoding on your dataset. Decision tree model cannot deal with the categorical variables.

please complete this step before you fit the model. Use pd.get_dummies .


#12

Hi @aaron11,
1 Read the data.
2 Check for any missing values. This blog may help you.
https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/

data.isnull().sum()

  1. Fill in the missing values

data[‘Gender’].fillna(‘Male’,inplace=True)
data[‘Married’].fillna(‘Yes’,inplace=True)
data[‘Dependents’].fillna(‘0’,inplace=True)
data[‘Education’].fillna(‘Graduate’,inplace=True)
data[‘Self_Employed’].fillna(‘No’,inplace=True)
data[‘Property_Area’].fillna(‘Semiurban’,inplace=True)
data[‘Loan_Amount_Term’].fillna(360,inplace=True)

data[‘LoanAmount’].fillna(data[‘LoanAmount’].mean(), inplace=True)

conditions = [data[‘Loan_Status’] == ‘Y’, data[‘Loan_Status’] == ‘N’]
values = [1.0, 0.0]
data[‘Credit_History’] = np.where(data[‘Credit_History’].isnull(),
np.select(conditions, values),
data[‘Credit_History’])

There should not be any missing values now.

data.isnull().sum()

  1. Create new variables to nullify the effect of outliers.

data[‘LoanAmount_log’] = np.log(data[‘LoanAmount’])
data[‘TotalIncome’] = data[‘ApplicantIncome’] + data[‘CoapplicantIncome’]
data[‘TotalIncome_log’] = np.log(data[‘TotalIncome’])

  1. sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding the categories.

var_mod=[‘Gender’,‘Married’,‘Education’,‘Self_Employed’,‘Property_Area’,‘Loan_Status’]
for i in var_mod:
data[i] = data[i].astype(‘category’)
for i in var_mod:
data[i] =data[i].cat.codes
After for statements give a tab space in the next line.

  1. Import models from scikit learn module:(Code as given in the tutorial + confusion matrix)

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold #For K-fold cross validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics
from sklearn.metrics import confusion_matrix

#Generic function for making a classification model and accessing performance:
def classification_model(model, data, predictors, outcome):
#Fit the model:
model.fit(data[predictors],data[outcome])

#Make predictions on training set:
predictions = model.predict(data[predictors])

#Print accuracy
accuracy = metrics.accuracy_score(predictions,data[outcome])
print(“Accuracy : %s” % “{0:.3%}”.format(accuracy))

#Perform k-fold cross-validation with 5 folds
kf = KFold(data.shape[0], n_folds=5)
error = []
for train, test in kf:
# Filter training data
train_predictors = (data[predictors].iloc[train,:])

# The target we're using to train the algorithm.
train_target = data[outcome].iloc[train]

# Training the algorithm using the predictors and target.
model.fit(train_predictors, train_target)

#Record error from each cross-validation run
error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))

print(“Cross-Validation Score : %s” % “{0:.3%}”.format(np.mean(error)))

#Fit the model again so that it can be refered outside the function:
model.fit(data[predictors],data[outcome])
cm=confusion_matrix(predictions,data[outcome])
print(cm)

  1. Call the models one by one and get the metrics

outcome_var = ‘Loan_Status’
model = LogisticRegression()
predictor_var = [‘Credit_History’,‘LoanAmount_log’,‘TotalIncome_log’,‘Gender’,‘Married’,‘Education’,‘Self_Employed’,‘Property_Area’]
classification_model(model,data,predictor_var,outcome_var)

output is
Accuracy : 83.062%
Cross-Validation Score : 83.065%
[[ 95 7]
[ 97 415]]

7-ii

model = DecisionTreeClassifier()
classification_model(model,data,predictor_var,outcome_var)

7-iii

model = RandomForestClassifier(n_estimators=100)
classification_model(model,data,predictor_var,outcome_var)

Experiment with different features and with different models!!!

https://github.com/ml-ds-data/DataScience/blob/master/Loan-Prediction.ipynb


#13


#14

What I must do next I have tried a lot to remove this error, but not know exactly how will It gonna be solved


#15


#16

Finally It worked, Thankyou