I have a dataset where each columns for a variable has data entries spanning with multiple entries and hence I figured may be the bag of words function of scikit may help me to convert these into viable features.But,however,I have never worked with this function and need an example as such,regarding the same.I searched for blogs or some random articles related to it,but was unable to find anything useful.Please help.

# How to use tf-idf feature of scikit-learn?

This normalization is implemented by the text.TfidfTransformer class in Python.

Understanding tf-idf:

Typically, the tf-idf weight is composed by two terms: the first

computes the normalized Term Frequency (TF), aka. the number of times a

word appears in a document, divided by the total number of words in that

document; the second term is the Inverse Document Frequency (IDF),

computed as the logarithm of the number of the documents in the corpus

divided by the number of documents where the specific term appears.

TF:

Term Frequency, which measures how frequently a term occurs in a

document. Since every document is different in length, it is possible

that a term would appear much more times in long documents than shorter

ones. Thus, the term frequency is often divided by the document length

(aka. the total number of terms in the document) as a way of

normalization:

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

IDF:

Inverse Document Frequency, which measures how important a term is.

While computing TF, all terms are considered equally important. However

it is known that certain terms, such as “is”, “of”, and “that”, may

appear a lot of times but have little importance. Thus we need to weigh

down the frequent terms while scale up the rare ones, by computing the

following:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

See below for a simple example.

Example:

Consider

a document containing 100 words wherein the word cat appears 3 times.

The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now,

assume we have 10 million documents and the word cat appears in one

thousand of these. Then, the inverse document frequency (i.e., idf) is

calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is

the product of these quantities: 0.03 * 4 = 0.12.

Implementing tf-df in Python: (go to section 4.2.3.4. Tf–idf term weighting)

http://scikit-learn.org/stable/modules/feature_extraction.html

Hope this helps!

Similarity scores in strings in python

**rohanpota**#3

Hey,thank you! for this.But the problem is I already know this,but I don’t know how to use for classfication containing multiple columns of varied text data.I mean how to set X and y for classification?Could please help me with that?TIA