I am trying to train a text classifier to identify stock market related news titles and facing some issues with prediction of unseen data. Its a binary classifier (2 classes- stock market related or not related). My training set is roughly 400 stock market news titles and 600+ non-stock market related titles.
Problems I have noticed -
- Its picking up all news with any number/currency symbol in it as stock market related.
Though the training set has many other words like investment, market etc. Its still picking up news/article titles like sales/deals (eg: with text like - Walmart 16GB RAMM $20.00)
- News with no currency symbol but numbers
eg: Changes in year 2018.
- Is this because of the short length of the “title” ? Most of the positive training data has words very specific to stock market. But its still picking up totally unrelated news titles.
- Should I include more negative test data? Will that make it better? (400 positive cases and 2000+ negative cases - will this create any bias/imbalance?
- Will removing numbers and currency symbols from training data help?