Short text classification - training data

text-classification

#1

I am trying to train a text classifier to identify stock market related news titles and facing some issues with prediction of unseen data. Its a binary classifier (2 classes- stock market related or not related). My training set is roughly 400 stock market news titles and 600+ non-stock market related titles.

Problems I have noticed -

  • Its picking up all news with any number/currency symbol in it as stock market related.
    Though the training set has many other words like investment, market etc. Its still picking up news/article titles like sales/deals (eg: with text like - Walmart 16GB RAMM $20.00)
  • News with no currency symbol but numbers
    eg: Changes in year 2018.

Questions -

  • Is this because of the short length of the “title” ? Most of the positive training data has words very specific to stock market. But its still picking up totally unrelated news titles.
  • Should I include more negative test data? Will that make it better? (400 positive cases and 2000+ negative cases - will this create any bias/imbalance?
  • Will removing numbers and currency symbols from training data help?