I sm currently applying tfidf through python sklearn library. When I apply learn function on dataset which contain one million rows of news articles (averaging 400 words each row) and title (averaging 100 words). I have output of 100gb (that too on 500k entries not on million) training file which i guess is pretty huge. I have seen post where people applied tfidf on many million of articles. Apart from thar it gave me memory error since all my ram and swapped space is consumed (32gb ram + 120gb swap). Anyone with experience with iftdf kindly guide me am i doing something wrong(which i suppose i am). What will be the possible issues and how to resolve it)
This is happening because you are using all the unique words in your data to prepare the TF-IDF matrix. For example, if you have n documents and k unique words in those documents, then the shape of the TF-IDF matrix will be n x k.
A quick workaround for this is to use max_df, min_df, and max_features hyperparameters of sklearn’s TfidfVectorizer.