How to use tf-idf feature of scikit-learn?

tf-idf
python
scikit-learn

#1

I have a dataset where each columns for a variable has data entries spanning with multiple entries and hence I figured may be the bag of words function of scikit may help me to convert these into viable features.But,however,I have never worked with this function and need an example as such,regarding the same.I searched for blogs or some random articles related to it,but was unable to find anything useful.Please help.


#2

This normalization is implemented by the text.TfidfTransformer class in Python.

Understanding tf-idf:
Typically, the tf-idf weight is composed by two terms: the first
computes the normalized Term Frequency (TF), aka. the number of times a
word appears in a document, divided by the total number of words in that
document; the second term is the Inverse Document Frequency (IDF),
computed as the logarithm of the number of the documents in the corpus
divided by the number of documents where the specific term appears.

TF:
Term Frequency, which measures how frequently a term occurs in a
document. Since every document is different in length, it is possible
that a term would appear much more times in long documents than shorter
ones. Thus, the term frequency is often divided by the document length
(aka. the total number of terms in the document) as a way of
normalization:

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

IDF:
Inverse Document Frequency, which measures how important a term is.
While computing TF, all terms are considered equally important. However
it is known that certain terms, such as “is”, “of”, and “that”, may
appear a lot of times but have little importance. Thus we need to weigh
down the frequent terms while scale up the rare ones, by computing the
following:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

See below for a simple example.

Example:

Consider
a document containing 100 words wherein the word cat appears 3 times.
The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now,
assume we have 10 million documents and the word cat appears in one
thousand of these. Then, the inverse document frequency (i.e., idf) is
calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is
the product of these quantities: 0.03 * 4 = 0.12.

Implementing tf-df in Python: (go to section 4.2.3.4. Tf–idf term weighting)
http://scikit-learn.org/stable/modules/feature_extraction.html

Hope this helps!


Similarity scores in strings in python
#3

Hey,thank you! for this.But the problem is I already know this,but I don’t know how to use for classfication containing multiple columns of varied text data.I mean how to set X and y for classification?Could please help me with that?TIA