How to use data augmentation on uneven multiclass dataset?



I have 12 classes(images) and uneven distributed data in each of these classes.

They are as follows(all images):

X1 = 16

X2 = 203

X3 = 192

X4 = 220

X5 = 172

X6 = 143

X7 = 22

X8 = 89

X9 = 31

X10 = 89

X11 = 10

X12 = 204

I am trying to train a CNN using the given datset. I want to know whether should I apply data augmentation to only the classes having less data or to all of the classes? Has anyone trained a similar model as mine? Also, what architecture of CNN should I use? I have used this(by applying data augmentation to all classes), but I stopped since the accuracy was around 14%(I stopped in between the first epoch)

model = Sequential()
model.add(Conv2D(32, (3, 3), input_shape=input_shape)) # input_shape = (150,150)
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(32, (3, 3)))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(64, (3, 3)))
model.add(MaxPooling2D(pool_size=(2, 2)))


Any help would be appreciated. If anyone has any tips, I would like to hear some. It’s giving me a hard time lately.


Hi @ZER-0-NE,

Sometimes the accuracy doesn’t always depend on the instance of classes available. I recommend you normalize your images and try to make them divergent from each other during preprocessing. Also, would advise you to get a baseline score using a simple Dense architecture. That will give you whether the architecture needs tuning.

The other thing you can try out is rotating images to 90, 60 and 180 degrees respectively for classes having fewer data. And add them as new images. By this, you should have 3x times more data. Let me know if this helped.



Hi @Shaz13
How do I use baseline score using keras? Can you point me somewhere where I can further read about it?
I applied augmentation to all my classes. Do I only need to apply it for classes having fewer data?
Thanks a lot for your reply.


Baseline score is nothing but the highest possible score using a very simple model. You could test it out with an algorithm from sklearn - like Naive Bayes or a simple Dense layer architecture with keras.

I wouldn’t recommend augmenting all classes, only the ones with fewer occurrences. Additionally, look into techniques like SMOTE.