Too many False Postives with Unbalanced Data


I am trying to predict customer churn in a telco company, using R.The dataset is very unbalanced, the target is around 0.6% of the base.

  • 8,746 Customers will Churn
  • 1,396,664 Customers do not churn

I have trained a Random Forest in R.Prior to training, I SMOTE the training data:

train.smote <- SMOTE(Churn~ ., train, perc.over = 100, perc.under=200

This gives me a 1:1 Balance. I then train the forest using:

fit<- randomForest(as.factor(Churn)~.,data=train.smote,importance=TRUE,

When I run,
on my validation Data, I get the following Confusion Matrix:

	         Positive	 Negative
Positive 	 1,136,610 	 234,625 
Negative	 3,762 	         5,911 

The F Score is 0.83, the Specificity is 0.61.
However, the number of False Positives is too high (234,625).
Please suggest a method to curb these False Positives without compromising on the True Positives.


Hi @mahawaseem,

XGB works well with unbalanced datasets. Try implementing it and check if the performance improves.

Thank you Aishwarya! I will try that and check. Is there a particular parameter in xgboost which supports unbalanced data?

Hi @mahawaseem

You can give your data in the ratio of 60:40 or 70:30 so that some or the other way model learns how to determine the minority even if the data is less , also try oversampling with Adasyn as it also adds noise in the artificially added minority class .
Refer the below link:

You may also try taking only sample of customers that do not churn and all the customer that churn and build a model.


© Copyright 2013-2019 Analytics Vidhya