How to improve score of an Binary Classification model (Attrition) with Imbalanced Data?




I am trying to create a binary classification model (Attrition) for an imbalance data using Random Forest - 0- 84K, 1- 16K. I have tried using class_weights = ‘balanced’, class_weights = {0:1, 1:5}, downsampling and oversampling but none of these seem to work. My metrics are usually in the below range:

Accuracy = 66%
Precision = 23%
Recall = 44%

I am not sure what else I can try to improve my metrics.
I would really appreciate any help on this! Thanks



To answer your question in more detail, I would suggest you to give more description of the data. What is the problem you are trying to solve? What are the dependent and independent variables? What is the proportion of class labels etc

Generally, here is what you can do,

  • You can try to remove the correlated variables first. After removing the most correlated variables, make a model on the remaining variables and calculate the scores.

  • If this does not improves the score, plot ROC and calculate the roc_auc_score. Then first improve this score by changing the threshold values which is 0.5 as default and use the optimum threshold value and then calculate the scores. It will improve the performance of your model.

To learn further on how to deal with imbalanced class, you can refer this article.



Thanks for the reply! I initially started with 8 variables and removed 2 as the were highly correlated. My Dependent variable is Attrition and Independent Variables are mostly demographic eg: Income and Householdsize. The target variable (Attrition) can be either 0 or 1. I have around 84k data points with Attrition as 0 and 16k with data points with Attrition as 1. I divided my data in training and testing (60%:40%). My recall for the testing data set is always low. Below is my code

train, test = train_test_split(df, test_size=0.4)
X_train = train.drop(['Attrition'], axis=1)
Y_train = train['Attrition']
X_test =  test.drop(['Attrition'], axis=1)
Y_test =  test['Attrition']
rf = RandomForestClassifier(n_estimators = 100,random_state = 42, 
          n_samples_split=5,min_samples_leaf=5, class_weight= "balanced"), Y_train )

preds = rf.predict(X_test)
pd.crosstab(Y_test, preds, rownames=[‘Actual Target’], colnames=[‘Predicted Target’])

|Predicted Target |0| 1|
|Actual Target|

|0 39298 8929
|1 6860 2924

I will try to change the threshold value and analyze the performance.
Thanks for your help!