Extract words between symbols in R




I have data like Braund, Mr. Owen Harris and I want to extract the part between , and . in R.
How can it be done??


Regex to the rescue! There’s a whole family of functions that you can play with.



BTW, after you master it, you can do stuff like this:


hahaha…this is a damn good one @anon.
Yep,once I master it. :stuck_out_tongue:


@shuvayan ,
If I am not wrong, this data is from the Titanic problem on Kaggle! :stuck_out_tongue:
You can use something like->

strsplit(columnName, split=’[,.]’)

Hope it helps!


@Aditya_Sharma, wouldn’t that require the strings in the each row to contain the exact string '.,' as a delimiter?



Yes it would. And I wrote columnName because I was sure that Titanic data had all the rows in that column with ’ . , ’ as the delimiter. We can use this if we want to do it for just one row->

strsplit(columnName[rowNumber], split=’[,.]’)


Oh, okay, I just figured out how it works. (Hadn’t taken into consideration the square brackets, initially.) Yes, in this case that’s a simpler solution than using grep. Thanks. :smile:



test <- "AB,C 123.DE"
regmatches(test, regexec(',(.*?)\\.', test))[[1]][2]
[1] "C 123"

In the test we would like to pull the elements of the test string between the comma and the period. The regular expression that we use is ',(.*?)\\.'. The parantheses represent the group that we would like to capture. Two backslashes are used in conjunction with the period in accordance with R syntax.