Creating confusion mattrix on Loan Prediction dataset



I want to make confusion matrix out of loan prediction data set, will somebody out there help me. I am referring to this topic


Import confusion matrix from sklearn. Then pass the actual values and predictions as arguments to the
confusion_matrix function. Then print the confusion matrix.

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(actual_values, predicted_values)

from sklearn.metrics import confusion_matrix
cm=confusion_matrix(data[outcome] , predictions)



Hi @aaron11,

The code provided by @A.Malathi is exactly what you need! In your code, you are using Credit History instead of the actual values. The confusion matrix is a table that describes the performance of a classification model. You would have to use the predicted values and the true values to print the confusion matrix.


Aishwarya, Will you please help me out in printing the confusion matrix. I’m not getting the result I have tried all possible way


Will you please help me to Identify Attributes? For the actual and predicted values


Hi @aaron11,

Follow these steps

  1. Load the dataset

    df = pd.read_csv('train.csv)
  2. Impute the missing values

  3. Create dummies

  4. Split the dataset into train and test using the below code

    from sklearn.model_selection import train_test_split
    train, test = train_test_split(df, test_size=0.3, random_state=0)
  5. Fit a model

    from sklearn import tree
    model = tree.DecisionTreeClassifier(random_state=1), y_train)
  6. Predict the values on test set

    pred = model.predict(x_test)
  7. Plot the confusion matrix

     from sklearn.metrics import confusion_matrix
    cm=confusion_matrix(y_test ,pred)

    My output looks like

    [[ 26 25]
    [ 26 108]]

  8. For better visualisation, go for this code

    cm = pd.crosstab(y_test, pred, rownames=['Actual'], colnames=['Predicted'], margins=True)


Predicted 0 1 All
0 27 24 51
1 21 113 134
All 48 137 185

Hope this helps!



I’m sorry but I’m new to all this and learning for the very first time that’s why just need spoon feed assistance. You’ve helped me alot please @AishwaryaSingh Please look into this


Looks like you have not performed one-hot encoding on your dataset. Decision tree model cannot deal with the categorical variables.

please complete this step before you fit the model. Use pd.get_dummies .


Hi @aaron11,
1 Read the data.
2 Check for any missing values. This blog may help you.


  1. Fill in the missing values


data[‘LoanAmount’].fillna(data[‘LoanAmount’].mean(), inplace=True)

conditions = [data[‘Loan_Status’] == ‘Y’, data[‘Loan_Status’] == ‘N’]
values = [1.0, 0.0]
data[‘Credit_History’] = np.where(data[‘Credit_History’].isnull(),, values),

There should not be any missing values now.


  1. Create new variables to nullify the effect of outliers.

data[‘LoanAmount_log’] = np.log(data[‘LoanAmount’])
data[‘TotalIncome’] = data[‘ApplicantIncome’] + data[‘CoapplicantIncome’]
data[‘TotalIncome_log’] = np.log(data[‘TotalIncome’])

  1. sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding the categories.

for i in var_mod:
data[i] = data[i].astype(‘category’)
for i in var_mod:
data[i] =data[i]
After for statements give a tab space in the next line.

  1. Import models from scikit learn module:(Code as given in the tutorial + confusion matrix)

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold #For K-fold cross validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics
from sklearn.metrics import confusion_matrix

#Generic function for making a classification model and accessing performance:
def classification_model(model, data, predictors, outcome):
#Fit the model:[predictors],data[outcome])

#Make predictions on training set:
predictions = model.predict(data[predictors])

#Print accuracy
accuracy = metrics.accuracy_score(predictions,data[outcome])
print(“Accuracy : %s” % “{0:.3%}”.format(accuracy))

#Perform k-fold cross-validation with 5 folds
kf = KFold(data.shape[0], n_folds=5)
error = []
for train, test in kf:
# Filter training data
train_predictors = (data[predictors].iloc[train,:])

# The target we're using to train the algorithm.
train_target = data[outcome].iloc[train]

# Training the algorithm using the predictors and target., train_target)

#Record error from each cross-validation run
error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))

print(“Cross-Validation Score : %s” % “{0:.3%}”.format(np.mean(error)))

#Fit the model again so that it can be refered outside the function:[predictors],data[outcome])

  1. Call the models one by one and get the metrics

outcome_var = ‘Loan_Status’
model = LogisticRegression()
predictor_var = [‘Credit_History’,‘LoanAmount_log’,‘TotalIncome_log’,‘Gender’,‘Married’,‘Education’,‘Self_Employed’,‘Property_Area’]

output is
Accuracy : 83.062%
Cross-Validation Score : 83.065%
[[ 95 7]
[ 97 415]]


model = DecisionTreeClassifier()


model = RandomForestClassifier(n_estimators=100)

Experiment with different features and with different models!!!



What I must do next I have tried a lot to remove this error, but not know exactly how will It gonna be solved



Finally It worked, Thankyou