How to model rare even population in classification problem

rare_event
classification

#1

Hi all ,

I am working on a classification problem. To classify whether a particular event happens or not.(Yes/No). My target variable is dominated by “No” and I have less “Yes”. That is “Yes” comprise of only 2% of the entire dataset. Hence the model performance is bad. However if i sample the dataset in such a way that “Yes” comprises of atleast 10% of the dataset, the model performance is good. I want to understand whether this approach is right or not.

Regards,
Karthikeyan P


#2

@karthe1

This is a better approach - infact the one which I would have also gone for. However, you will need to make slope and intercept corrections at the end of the problem.

Here is a framework to solve this problem:

And here is the article, which explains this in detail:

Looks like you haven’t read ll the articles still :wink:

Kunal


#3

:slight_smile:

Thank you very much. You caught me hand. I have not read this article. :smile:


#4

Hi @kunal sir,

I am using decision trees as I have more categorical variable in dataset. Can i use the above method tailored to decision tree? You have any thoughts?

I also came across a technique called SMOTE, which over samples the rare events and under samples the frequent events. I am not sure it does the automatic correction or not. Do you have any idea on this? I had applied this technique to my dataset and the results good in few test datasets and goes in-consistent in couple of dataset. Really not sure if this in-consistency is because of the nature of the data.

Any sort of advice will be helpful.

Regards,
Karthikeyan P


#5

@kunal sir @tavish_srivastava ,

is the rare event modelling applicable only for logistic regression? I am working on a decision tree model, just want to know if there any similar procedure for decision trees as well.

Regards,
Karthikeyan P


#6

Thats a very good question! The answer lie in the question : why do we do over-sampling in Logistic Regression on the first place? The reason is to remove the bias from the coefficients predicted from the model. When you are building a decision tree, you are not estimating any coefficient and hence, over-sampling is not done.

Hope this helps,
Tavish


#7

Hi @tavish_srivastava,

Thank you very much for the reply.

So do you meant to say that the over bias problem (5% attrition and 95% non attrition) will affect only the coefficient. How the decision tree assumes these type of Rare events?

Is it right to assume that when i build a decision tree model, I should still do the biased sampling but not the final correction step?

Kindly advice. Thanks again.

Regards,
Karthikeyan P