I was working on feature generation. The goal is to find promising features that may separate the 2 classes better, BEFORE running the model.
The data is severely imbalanced, positive:negative is almost 1:80000.
The features are all continuous numerical data.
Last week I was checking the features through initial distribution to see whether the feature could separate the 2 classes better.
Here’s 2 sample Distribution Plots:
Red is positive class, and blue is negative class.
The way I was checking whether a feature could separate 2 classes here was to see whether there are larger areas with higher red percentage & lower blue percentage, or lower red percentage & high blue percentage.
Comparing with many other features I have generated which has large blue&red overlap, the first image shown above is already much better, and I considered that feature as a promising feature.
But my manager thinks it’s a bad distribution, instead he thinks the second image above shows a better feature.
The problem is, if you look at y-axis, the overlapped percentage of first image is not that high, but the overlapped area in the second image is not low.
Now I decided to just prove each feature importance by running the model, such as removing each feature and check out of bag score from random forest, or just check feature importance after running the models. Although the severe data imbalance situation made it difficult to check each feature importance.
So, the question is, is there any good way to explore whether a feature is promising BEFORE running the model? In such data imbalanced situation?
If we have to run the model to understand whether each feature is importance, in such data imbalanced situation, what will be recommended methods to do?