Omit sentences that aren't related to neither class (Binary Text classification)

Hi,

I am building a binary text classifier to classify sentences in research papers (cell culture medium research papers) in python by testing out the common algorithms for binary classification like linear svc, logistic regression, etc. The problem I’m facing is that even though there is a high accuracy when the model is trained with the collected data, when I try it with a complete research paper there are many common sentences that are irrelevant (which does not belong to either class) but the problem is, obviously for those sentences also a class output will be given by the model. How should I handle those irrelevant sentences ?,

Thanks in advance

Hi,
Are you classifying it with sentences or words.Working with text data requires a lot of preprocessing.
If you are talking about irrelavent words ,those might be stop words which typically refers to the most common words in a language.There are easy ways to remove them in many text mining packages or you can give a custom list of stop words related to your data and remove them.
Follow the same preprocessing steps for both training and test data.

Let us know if i answered your query.

Hey,
Thanks for the response. I am iterating through sentences and classifying those sentences using the words in those sentences via TFIDF generated vocabulary. I have done alot of preprocessing and removed all the stop words as well. But the problem is some of those sentences in the research doesn’t fall into the categories i’m training but the model will obviously give a class output out of the two classes I have trained. But actually the output for those sentences should be nothing or skip those sentences. Is there a workaround for this ?

Thanks.

© Copyright 2013-2020 Analytics Vidhya