Thanks for the reply! I initially started with 8 variables and removed 2 as the were highly correlated. My Dependent variable is Attrition and Independent Variables are mostly demographic eg: Income and Householdsize. The target variable (Attrition) can be either 0 or 1. I have around 84k data points with Attrition as 0 and 16k with data points with Attrition as 1. I divided my data in training and testing (60%:40%). My recall for the testing data set is always low. Below is my code
train, test = train_test_split(df, test_size=0.4)
X_train = train.drop(['Attrition'], axis=1)
Y_train = train['Attrition']
X_test = test.drop(['Attrition'], axis=1)
Y_test = test['Attrition']
rf = RandomForestClassifier(n_estimators = 100,random_state = 42,
n_samples_split=5,min_samples_leaf=5, class_weight= "balanced")
rf.fit(X_train, Y_train )
preds = rf.predict(X_test)
pd.crosstab(Y_test, preds, rownames=[‘Actual Target’], colnames=[‘Predicted Target’])
|Predicted Target |0| 1|
|0 39298 8929
|1 6860 2924
I will try to change the threshold value and analyze the performance.
Thanks for your help!