Extract all words with characters 'NN' from tags obtained (in R)

text_mining
parsing
string

#1

I have entries of the form below in a file. I am working with R and I want to extract all the ‘NN’ words. Can someone help me with this?

> as.character(tags[1,3])
[1] "[[('Old', 'NNP'), ('seattle', 'VBP'), ('getaway', 'NN'), ('This', 'DT'), ('was', 'VBD'), ('Old', 'NNP'), ('World', 'NNP'), ('Excellence', 'NNP'), ('at', 'IN'), (\"it's\", 'NNP'), ('best', 'JJS')]]"

Output should have the word ‘getaway’


#2

Here is a hack/work around

Install ‘stringr’ : if not done. The whole code is intuitive

gibberish <- "[[('Old', 'NNP'), ('seattle', 'VBP'), ('getaway', 'NN'), ('This', 'DT'), ('was', 'VBD'), ('Old', 'NNP'), ('World', 'NNP'), ('Excellence', 'NNP'), ('at', 'IN'), (\"it's\", 'NNP'), ('best', 'JJS')]]" 

sensible <- str_replace_all(gibberish, "[[:punct:]]","")

sensible <- unlist(strsplit(sensible," "))

index <- which(sensible=="NN")

sensible[index-1] 

voila :sunglasses:


#3

Try:

str_extract_all(x, "(?<=\\(')\\w+(?=', 'NN'\\))")
[[1]]
[1] "getaway"

To explain what is going on, we use both lookbehinds and lookaheads.

  1. Lookbehind - "(?<=\\(')"

  2. Lookahead - "(?=', 'NN'\\))"

  3. Match - "\\w+"

We match any word that comes after an open parenthesis and single quote (' and before "', 'NN')".

The general pattern is "(?<=string1)capture(?=string2)".


#4

Where can I study and understand this?


#5

This site may help http://www.regular-expressions.info/rlanguage.html