Improper output of class Probabilities in Random Forest Classifier

random_forest
python

#1

Hi! My dataset has 140k rows with 5 attributes and 1 Attrition as target variable (value can either be 0 (Customer Does not churn) or 1 (Customer churn)). I divided my dataset in 80% training and 20% testing. My dataset is heavily imbalanced. 84% of my dataset has 0 as target variable and only 16% has 1 as target variable.

The feature importance of my training dataset is as follows:

ColumnA = 28%, ColumnB = 27%, AnnualFee- 17%, ColumnD - 17% an ColumnE - 11%

I initially wanted to do a very simple check of my model. After creating a Random Forest Classifier I tested the model on a dataset with just 5 rows. I kept all variables constant except Column AnnualFee. Below is a snapshot of my test data:

 Column A	Column B	             AnnualFee	          ColumnD	         ColumnE
 4500                3.9                  5%                2.1               7
 4500                3.9                  10%               2.1               7
 4500                3.9                  15%               2.1               7
 4500                3.9                  20%               2.1               7
 4500                3.9                  25%               2.1               7

I expected that as annual fee increases the probability of customer churn also increases. But my rf.predict_proba(X_test) seems to be all over the place. I am not sure why this is happening:

I tried two different codes but the anomaly seems to be happening on both codes:

Code 1:

rf = RandomForestClassifier(n_estimators = 400,random_state = 0, 
min_samples_split=2,min_samples_leaf=5,
                      class_weight = {0:.0001,1:.9999})
rf.fit(X_train, Y_train )

Code 2: Not My Code - Got it Online

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
clf_4 = RandomForestClassifier(class_weight = {0:1,1:5})
estimators_range = np.array([2,3,4,5,6,7,8,9,10,15,20,25])
depth_range = np.array([11,21,35,51,75,101,151,201,251,301,401,451,501])
kfold = 5
skf = StratifiedKFold(n_splits = kfold,random_state = 42)

model_grid = [{'max_depth': depth_range, 'n_estimators': estimators_range}]
grid = GridSearchCV(clf_4, model_grid, cv = StratifiedKFold(n_splits = 5, 
random_state = 42),n_jobs = 8, scoring = 'roc_auc')
grid.fit(X_train,Y_train)

I would really appreciate any help on this!


#2

Hi @psnh ,

Can you please explain what you meant by

As you have mentioned that it is an imbalanced class problem, you can use resampling techniques such as under sampling or over sampling to train the model. I will explain each of them in brief below -

Under sampling aims to balance class distribution by randomly eliminating majority class examples. This is done until the majority and minority class instances are balanced out.

Over-Sampling increases the number of instances in the minority class by randomly replicating them in order to present a higher representation of the minority class in the sample.

To read more on how to deal with imbalanced classification problems, you can go through this article

And can you please specify why you have chosen the class_weight in the model as “{0:.0001,1:.9999}” and why not any other values?


#3

@PulkitS - thanks for your reply
Below is a snapshot of my probability distribution
Does not Churn Churn
array([[ 0.52599144, 0.47400856],
[ 0.51686896, 0.48313104],
[ 0.50511793, 0.49488207],
[ 0.49370709, 0.50629291],
[ 0.52567296, 0.47432704]])
Probability of Churn drops from 50.6% to 47% although annual fee increases. I expected the probability of churn to increase with increase in annual fee

I tried randomoversampling, oversampling, SMOTE but my recall and precision for minority class is always very low. I also tried class_weight = ‘balanced’ but the accuracy of my model was very low. I then randomly tried class weights = {0:.0001,1:.9999} and got accuracy = 59%.

I am trying to work on customer sensitivity to Increase in Annual Fee


#4

Hi @psnh,

To improve the recall and precision value, you can plot the ROC curve and calculate the roc_auc_score. This score gives us the area under ROC curve. It describes how well the classes of the predictions(0 and 1 in your case) have been separated. Higher the area under curve, better the prediction power of the model.

Once you get the roc_auc_score, you can try to improve it by changing the threshold values which is 0.5 as default. By default i meant if the probability of sample is less than 0.5, it will be assigned 0 and if it is more than 0.5, it will be assigned 1. So you can change this threshold value and choose the optimum value which gives the maximum roc_auc_score.

If the roc_auc_score is improved, it means that the accuracy will also improve.


#5

HI @PulkitS!

Thanks again for your reply! Would lowering the threshold still help if I only concerned about the probability as an output? Irrespective of whether the output is 0 or 1 I just wanted to see the probability to increase from 50.6% at 20% increase to something higher at 25% increase. Logically, it makes sense that if the model is 50% confident that the customer will churn at 20% increase then the model should give a higher probability of churn (greater than 50%) when the increase is also higher. Please let me know if I am not understanding this correctly.

Really appreciate your help on this!


#6

Hi @psnh,

Probability is not an evaluation matrix, so you cannot decide whether your model is accurate or not based on the probability values.

So, despite of the probability values, try to improve the overall accuracy of your model by finding the optimized threshold value.

Hope this helps!!


#7

HI @PulkitS,

This is really helpful! But for some of my test cases I am not sure whether the output should be 0 or 1. For example: All I know is that when I give a customer 15% increase the customer does not churn (output value 0 - probability 43%). Keeping this in mind I want to find the customers sensitivity at 20%, 25% and 30%.

Since I don’t know the customers behavior at these percentages I don’t have an output value to test whether my model is correct or not. All i know is that the customer did not churn at 15%. He could or could not churn at 20% or 30%. The only way I can evaluate his sensitivity is by looking at the probability.

If at 20% increase the probability increases to 45% then I know that the customer is more sensitive to churn with increase in percentage but if the probability drops to 40% then that means he is less sensitive to 20% increase than 15% (which I am not sure if it makes sense).

Sorry If my question sounds repetitive. Thanks for your help!


#8

if you do not mind, can you send the dataset and we can assist ?