Extract words between symbols in R

r
data_wrangling

#1

Hello,

I have data like Braund, Mr. Owen Harris and I want to extract the part between , and . in R.
How can it be done??


#2

Regex to the rescue! There’s a whole family of functions that you can play with.

https://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html
http://www.regular-expressions.info/rlanguage.html


#3

BTW, after you master it, you can do stuff like this:


#4

hahaha…this is a damn good one @anon.
Yep,once I master it. :stuck_out_tongue:


#5

@shuvayan ,
If I am not wrong, this data is from the Titanic problem on Kaggle! :stuck_out_tongue:
You can use something like->

strsplit(columnName, split=’[,.]’)

Hope it helps!


#6

@Aditya_Sharma, wouldn’t that require the strings in the each row to contain the exact string '.,' as a delimiter?


#7

@anon,

Yes it would. And I wrote columnName because I was sure that Titanic data had all the rows in that column with ’ . , ’ as the delimiter. We can use this if we want to do it for just one row->

strsplit(columnName[rowNumber], split=’[,.]’)


#8

Oh, okay, I just figured out how it works. (Hadn’t taken into consideration the square brackets, initially.) Yes, in this case that’s a simpler solution than using grep. Thanks. :smile:


#9

Try:

test <- "AB,C 123.DE"
regmatches(test, regexec(',(.*?)\\.', test))[[1]][2]
[1] "C 123"

In the test we would like to pull the elements of the test string between the comma and the period. The regular expression that we use is ',(.*?)\\.'. The parantheses represent the group that we would like to capture. Two backslashes are used in conjunction with the period in accordance with R syntax.