Multi class Text classification

Hi Team,
when we build a text classification and do a correlation to see whether there exists a signal
for example stated below

*Bank account or service ’:
. Most correlated unigrams:
. bank
. overdraft
. Most correlated bigrams:
. overdraft fees
. checking account

Consumer Loan ’:

. Most correlated unigrams:
. car
. vehicle
. Most correlated bigrams:
. vehicle xxxx
. toyota financial

‘ *redit card ’:

. Most correlated unigrams:
. citi
. card
. Most correlated bigrams:
. annual fee
. credit card

Credit reporting ’:

. Most correlated unigrams:
. experian
. equifax
. Most correlated bigrams:
. trans union
. credit report

Does this mean if we are predicting new text for a category /label these word must be existing in the new text for prediction. please confirm.

Almost certainly, yes. Your new text should contain words that are present in your training data to classify well. Naive Bayes, SVM, Random Forest etc are some classic models you can look at for this task.

If you want the models to work well on new words too, then you can look at using Word Embeddings that are pre-trained on corpus that’s not just limited to your training data. So the word embeddings would have information about words that might come in future as well and not yet seen in your training data. But beware of the following:

  1. Models using Word Embeddings are much more complex compared to models such as Naive Bayes or SVM.
  2. Even if you have pre-trained word embeddings, words in your future data should still have their embeddings in the pre-trained embeddings.
© Copyright 2013-2019 Analytics Vidhya