Too many False Postives with Unbalanced Data

r
random_forest

#1

Hi!

I am trying to predict customer churn in a telco company, using R.The dataset is very unbalanced, the target is around 0.6% of the base.

  • 8,746 Customers will Churn
  • 1,396,664 Customers do not churn

I have trained a Random Forest in R.Prior to training, I SMOTE the training data:

train.smote <- SMOTE(Churn~ ., train, perc.over = 100, perc.under=200

This gives me a 1:1 Balance. I then train the forest using:

fit<- randomForest(as.factor(Churn)~.,data=train.smote,importance=TRUE,
ntree=500)

When I run,
pred=predict(fit,newdata=test,type=“class”)
on my validation Data, I get the following Confusion Matrix:

	         Positive	 Negative
Positive 	 1,136,610 	 234,625 
Negative	 3,762 	         5,911 

The F Score is 0.83, the Specificity is 0.61.
However, the number of False Positives is too high (234,625).
Please suggest a method to curb these False Positives without compromising on the True Positives.

Thanks!


#2

Hi @mahawaseem,

XGB works well with unbalanced datasets. Try implementing it and check if the performance improves.


#3

Thank you Aishwarya! I will try that and check. Is there a particular parameter in xgboost which supports unbalanced data?