Lemmatizing dataframe using NLTK

python

#1

I was trying to lemmatize a dataframe. In that it converts singular into plural. But I also need to find its root word like Blessing->bless, ran->run, reached -> reach

Below is the sample program I tried.

import nltk

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
_ return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]_

df = pd.DataFrame([‘this was cheesy’, ‘she likes these books’, ‘wow this is great blessing’], columns=[‘text’])
print(df)
df[‘text_lemmatized’] = df.text.apply(lemmatize_text)
print(df)


#2

Hi @prakash6654,

You could use Snowball Stemmer for this as lemmatization cannot always get to the root word. In my experiences Snowball Stemmer gets this done.

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
print(stemmer.stem("Blessing"))
bless
print(stemmer.stem("reached"))
reach

Hope this helped. Thanks!


#3

Thanks for your reply. I tried this already and it’s working for string value but for data frame its not working


#4

Hi @prakash6654,
Can you paste the complete code so that I can look into it and assist back? Thanks!


#5

import nltk
import pandas as pd

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

df = pd.DataFrame([‘this was cheesy blessing’, 'she likes these books ', ‘wow this is great amazing’], columns=[‘text’])
print(df)

df[‘text_lemmatized’] = df.text.apply(lemmatize_text)
print(df[‘text_lemmatized’])


#6

I found my error.
in the function lemmatize_text, I missed to call pos_tag parameter (v).

return [lemmatizer.lemmatize(w,‘v’) for w in w_tokenizer.tokenize(text)]

Now Blessing is coming as Bless in a data frame

Referred: https://rustyonrampage.github.io/text-mining/2017/11/23/stemming-and-lemmatization-with-python-and-nltk.html