Imbalanced Dataset and Scaling Variables



Hi Team,

I have below two queries, Can you please help me in that -

Imbalanced Dataset
When working on Churn prediction/credit risk, there are quite few defaulters which cause imbalance in dataset. I know couple of methods to fix this like loss matrix, additional weight to less occurring outcome. But can we decide how much weight we need to give. For e.g. If I give 0.7 to non-default and 0.3 to defaulters. How can I decide these weightage

Scaling variables
In some scenarios, one variables have different scale (10,000 - 1,000,000) and others have different(1-10). How can we normalize it. Can you please let us know different methods (log function) also please confirm if we use z-score for same purpose

Thanks in advance!



For imbalanced dataset, please refer to the nice blog post for different techniques.

For feature scaling, you can use two approaches to bring different features onto the same scale.

Normalization : rescaling the features to the range [0,1]. one can apply min-max scaling,
x_norm = (x^i - x_min)/(x_max - x_min)

where, x_min is the smallest and x_max is the largest.

Standardization : this technique, center the feature column at mean 0, with standard deviation 1, so that feature take the form of a normal distribution (same as z-score).
x_std = (x^i-mu)/sigma
where, mu is feature mean and sigma is feature standard deviation.

Hope this helps.:smile:


Hi , You can perform under sampling of highly representative class or over sampling of less representative class to balance the classes in the dataset. You can explore options like SMOTE as well.