NLP Tokenization

Hi Team

Architecture Experience - 5 years minimum across multiple disciplines (Agile Methodologies, Scrum Framework, Risk management, Technology planning, design, development, System Integration).

when i Tokenise the above text i am getting results as below

Agile
Methodologies
Scrum
Framework
Risk
management,
Technology
planning,
design,
development,
System
Integration

**what is expected **

Agile Methodologies,
Scrum Framework,
Risk management,
Technology planning,
System Integration

please help me to achieve the above

Regards,
tony

Hi @train.bi
It took a wile to figure this out. Below is the code snippet to achieve this. I used NLTK package for this task.

import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.util import ngrams

you need to add punctuations to the stopswords list as NLTK default tokenizer doesn’t include this.

stop_words = stopwords.words('english') + [p for p in string.punctuation]
words=[i for i in nltk.word_tokenize(text) if i not in stop_words]
bigrams = [" ".join(i) for i in ngrams(words, 2)]
bigrams[0::2]

The bigram result is
image

Hope this helps.:smiley:

thanks for your inputs

i used phrase matching and solved the issue

Hi, nice to know that you figured it out. Can you please share what is this phrase matching? interested to know.

Answer inline

rom collections import Counter
from spacy.matcher import PhraseMatcher
color_patterns = [nlp(text) for text in ('red', 'green', 'yellow')]
product_patterns = [nlp(text) for text in ('boots', 'coats', 'bag')]
material_patterns = [nlp(text) for text in ('bat', 'yellow ball')]

matcher = PhraseMatcher(nlp.vocab)
matcher.add('COLOR', None, *color_patterns)
matcher.add('PRODUCT', None, *product_patterns)
matcher.add('MATERIAL', None, *material_patterns)
d = []
doc = nlp("yellow ball yellow lines")
matches = matcher(doc)
for match_id, start, end in matches:
    rule_id = nlp.vocab.strings[match_id]  # get the unicode ID, i.e. 'COLOR'
    span = doc[start : end]  # get the matched slice of the doc
    d.append((rule_id, span.text))
print("\n".join(f'{i[0]} {i[1]} ({j})' for i,j in Counter(d).items()))

COLOR yellow (2)
MATERIAL yellow ball (1)

Thank you for sharing this.

© Copyright 2013-2019 Analytics Vidhya