Does oversampling really help in increasing model performance

oversampling

#1

Hello,

I am working on a dataset which is skewed(10% Yes/90% No) and I tried to use the ROSE package to do oversampling and then do logistic regression on it but the accuracy hardly improved.
My code:

train.rose <- ROSE(Churn ~ ., data=train.data, seed=123)$data    
# Apply logistic:
rose.logit <- glm(Churn ~ .,data = train.rose, family = "binomial")
ROC(form = Churn ~ .,plot = c("sp","ROC"),PV = T,MX = T,MI = T,AUC = T,data = train.rose)

After this it is the usual prediction on test data and calculation of AUC metrics.
While the unbalanced dataset was giving me an roc area of 0.60 this is giving 0.61.
So I am not being able to understand how does oversampling actually help as it did not in this case.
Or am I going totally on the wrong track here and something else has to be done for oversampling the data.
Can someone please guide me on this.?


#2

Hi,

the only point I think off could the under sampling that ROSE will do, did you check size of the new ex majority class?
That would something to verify .

Alain


#3

Hi,
this is work under development, as my performance with the Hack were so bad I dig in what went wrong. I have included a small report (not so clean yet) using the hack dataUnbalance Class.pdf (249.6 KB).
Two methods:
-SMOTE different method than ROSE for synthetic variable, so with balance sample. As many majority as minority with Knn oversample in the last case
-Bias that is unbalanced. as much as 90% majority.

The model is built with Random Forest (from RandomForest package), try with other tree methods. The results on the test set are closed. Well I should do with a validation as well. Worth to notice that with unbalance the curve in accuracy has higher noise.

Hope this could help.

Alain


#4

thanks a lot @Lesaffrea.This is indeed helpful. :smile: