Which algorithm is best for text classification into 400 categories?



I have around 13000 rows of text data of health care data. Each row has raw text column which contains 1-10 sentences , and a category column which is one of the 400 categories or classes. Which classifier algortihms should I try on this data?

Some categories are independent while some are somewhat related. Distribution of data among categories is not uniform either i.e some of the categories(around 40 of them) have less data
I am attaching log probabilites of each class here.



You can try any of the following methods for creating a classification here:

  1. Naive Bayes classifier based on bag of words
  2. Support Vector machines

You can also look up details on Information Retrieval systems. Here is a good article from @tavish_srivastava to start the same:

Hope this helps.



I did Naive Bayes classifier based on bag of words. Can anyone suggest how many features(words) should I use

Has anyone tried gaussian naive bayes using tfidf.
Which SVM would be good?