Best approach for text classification in unbalanced dataset?



I am creating a text classification model. There are a total of 50 classes, with majority classes having around ~1000 samples and minority classes have only ~20-50 samples in the training data.

I have created a bag of ngrams based naive bayes classification model. However, the accuracy is not good. For example -

if word “x” is the top feature of Majority class, and weak feature for Minority Class. Most of the time any new test example containing “x” will be classified into Majority Class, even if it does not belong to Majority Class.

What are some of the good techniques to perform text classification for such type of cases. Or are there any other improvements that can be done?


Hi @shivam5992, I’m not well versed with text-analytics, but I suggest you to go through these links : paper on “Addressing_the_problem_of_Unbalanced_Data_sets_in_Sentiment_Analysis”, stackoverflow answer on dealing with textual unbalanced dataset


Hi Shivam,

There are several methods using which you can try to handle this situation, but a lot depends on the distribution of the classes and their values.

Various methods to handle imbalance can be classified in one of the following categories:

  1. Under sampling
  2. Oversampling
  3. Synthetic data generation
  4. Cost sensitive learning

You can refer to this article here for some more details on each of these heads:

Hope this helps


Thanks Kunal for the suggestion, Sampling techniques are good, but in my cases, the distribution is like -

Class 1: 1000 training examples
Class 2: 700 training examples
Class 3: 400 training examples

Class 50: 5 training examples

What sampling procedure will be good here ?


Hi Shivam,

I have the same issue with a text classification problem. Did you find any solutions?


Please use word embeddings. You will get generalised result.