I have 2lac records in sample with 70 variables. There are 40 categorical variables in which many data are missing and have blank values. There are few categorical variables which has 15 to 40 category. I am trying to predict rare event. Can some one help me understand the best approach to clean the data.
If you have a very high number of missing values and a low number of values are present in the column, you can consider removing the column.
Also, you can try reducing the number or categories in some cases, For example, you can look at the distribution of values for each category and the combine some together.
If you have categories like infant, kid, child, teen, adult, senior, old … . … and you want to reduce the number of categories, you can combine kid,child,teen and label them young, and the rest in elder . So this gives you just 2 categories.
Out of 40 categories that you have, if 30 of them have only 5-10 values and the rest 10 have values in thousands, you can combine these 30 into a label ‘other’. This will reduce the number of categories
Thanks for the reply.
I have performed the cleaning as you mentioned.
First : Removing the records which have blanks.
Second : Have also combined the categories where ever it was possible.
In Addition to this, since the model is for logistic regression (Rare event prediction) I am getting 98% R^2. Suppose 1 for true event and 0 for false event. Model is predicting 0 for all the true event and few 1 for false event. Hence the R^2 is high.
I need help on predicting rare event through logistic regression.
With what you have explained, this looks like an imbalanced classification problem. Can you tell me the distribution of your target variable? What is the total number of rows and how many of them are 1 and 0.
I have 160000 records out of which only 1.8% is true event that is 1. I have started looking the model with imbalanced sample lens and started sampling. Will run and see if I am getting any challenge. The only resource I am referring for imbalance dataset is google. Can you refer any book which can be helpful. I have gone through analyticsvidya article onnimbalanced data set as well.
You can try using boosting techniques. Also, accuracy is not the right metric to use in this case. If I do not run a model or make predictions, and simply assign all events as 0, i will have a good accuracy. You can use f1score or precision/recall to determine the performance of your model.
Below are a few links :