How to find trailing and leading words of a word using R?

text_mining
r

#1

Hello AVians,

I have a text document which has a million words. Now, I need to know how to find trailing and leading words of a word using R.

For example, If I want to find out the words that are coming before and after the word “error”. It could be anything like following with leading words

“typo error”
“manual error"
system error”

and with trailing words like

“error corrected”
“error found”
“error occured”

Any idea how to do this? Thanks in advance for your inputs.


#2

Regex is your best bet. A couple of rules can solve this problem.

Matching leading words:

\w+(?= error)

Matching trailing words:

(?<=error )\w+

For the particular case of R I didn’t find an easy solution, somehow it’s very messy to work with Regex in R. Personally, I would preprocess with some other tool and then use R just for the data analysis.