The judgment of the balance of dataset.

machine_learning
data_science
python

#1

When we can say the dataset is unbalanced?
my dataset is quite small. I have 527 rows with 354 class 1 and 173 class 0. Is this consider as unbalanced dataset?
Also, I wonder how to know if I had an overfitting problem + what visualization I could use to visualize the result and see if there is an overfitting problem.


#2

Hi @mayona,

The dataset as you mentioned, is very small. but if you convert your numbers to a percentage, you’d get a better understanding. Here we have about 67% positive class and 33% negative. Looks like unbalanced classes. (It depends on your problem statement as to what percentage you would classify as unbalanced)

Regarding overfitting, have you created a validation set to understand the model performance? is the model giving approximately same accuracy with the train and validation or is the difference very high?


#3

Thanks for your answer. regarding for validation set, How to create validation set. do mean cross validation ?


#4

Since you have small number of rows, preferable use cross validation. Then you can see if the performance is same on both, train and validation. If training accuracy is extremely high and testing is not the same, then your model is overfitting.


#5

T11%20PM hank you so much. I plotted the learning curve, but I could not understand the curve and see if the performance of my model is good or not.