I am going through the Term Frequency and Inverse Document Frequency representations used in bag of words technique in sklearn, There are two kind of representation that are available, one will tell the frequency of the word in the phrase(TF) and the other will tell about the frequency of the word in whole document (IDF).
My question is why do we weight rare words more in case of IDF representation? Aren’t they supposed to be some sort of outliers?
Thanks in advance
As per this article,
The idea behind making use of document frequency is that rare terms
are more informative than frequent terms. So if you remember earlier on
when we talked about stop words, which were words like “the” “and” “to”
and “of”, and so the idea was that these words were so common, so
semantically empty that we didn’t have to include them in our
information retrieval system at all. They had no effect on how good a
match a document was to a query.
IDF helps to bring out the unique aspects of the documents i.e how well can we differentiate one article from the other. It will be wrong to call them outlier.
For example: You have the task of collecting articles on Analytics vidhya and right now there are just two documents in the competition. Document 1 and 2 differ only in the sense that document 2 has a word vidhya in it. So you would like to recomment Documnet 2 intuitively.
Focusing on “vidhya” is what IDF does.
Hope, this helped.