How to handle an unbalanced dataset

skewed

#1

Hello,

I have a dataset in which the response variable(Loan Default) is highly skewed-(2600 Yes,260 No).
I have tried several methods like randomforest,decision tree and logistic regression but the results are not encouraging.I read somewhere about oversampling in which I can have a dataset with 260 No’s,a dataset created by selecting around 260 records from the part of the data having Default = Yes,then combine them and then apply logistic regression or decision tree.The results are marginally better(auc = 0.56) than when I use the real dataset(auc = 0.5) but is this the right way to go about this problem.
Can someone please help me in dealing with this problem??


#2

Hi,

you can do oversampling with closest neighbours and bootstrap, you can refer to Synthetic Minority Over-sampling Technique. The SMOTE function is available in R if you use this language for data mining package .
Be careful not to over sample for example wanted 2600 Yes and same amount of NO, more that double could led to issues. I do not know your data, I shall start with 520 No and 520 Yes and then adjust after the first trial.

Hope this help.

Alain