Need help with removing duplicate characters with Regex

nlp
nltk
python
regular_expression
regex
#1

Hello,

I understand the following code in parts. Need your help in demystifying the whole code block.
The following code helps remove the repeating character in a word.
e.g. It can convert “wooooowwww” to a “wow”…“Yesssss” to a “Yes” etc

old_word = 'finalllyyy' 
repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)') 
match_substitution = r'\1\2\3' 
new_word = repeat_pattern.sub(match_substitution,old_word) 

This is how I understand it.

repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)') 

Group 1 which captures a set of zero or more characters. Group 2 captures a character and also captures a back reference to it. Finally, capture another set of characters in group 3. Is my understanding correct?

new_word = repeat_pattern.sub(match_substitution,old_word)

The above snippet is doing some sort of replacement but I am not sure how. Can someone explain?

Thanks

Mohit

#2

OK, let’s go:

  • (\w*) is any kind of word character (letters, digits, underscore – varies depending on locale settings, can f.e. include french letters with accents), zero or multiple times (by using the quantifier * ).

  • Next it tries to match just one single word character (\w) – and then that same character again, using \2 , which is a back reference to the second match in the expression, which was the \w character matched before.

  • And after that, again zero or multiple word characters, same as at the beginning.

  • If that expression matches, then match_substitution = r'\1\2\3' replaces it – again, using back references – with the matches that were made capturing subpatterns using parentheses in the search pattern.

  • So every matched part gets replaced by itself – except for the repeated character match \2 , which does not have grouping parentheses.

I did not get it to work using the leading and trailing (\w*) though – but since these also match zero word characters, I think they can be ditched altogether.

So this should do what you want to achieve:

repeat_pattern = re.compile(r'(\w)\1*')
match_substitution = r'\1'

(Since I removed the leading capturing subpattern here, \2 was replaced by \1 , referencing the now first capturing subpattern.)

With the above changes, I was able to get the right corrections:

Screenshot%20from%202019-05-08%2018-36-30

Notice the asterisk I added in the pattern to denote that if a character is repeated any number of times, catch it. As you can see in the above picture, it does not matter how many times a char is repeated we still get the correct word.

Hope this helped you! :slight_smile:
Sanad

1 Like
#3

Thanks for your help.

I have a few doubts related to the above question. I tried clearing up my doubts on regex101 but it complicated things further.

old_word =“loooveee”

  • what is the difference between (\w*) , (\w)* ?
  • (\w*)(\w)\2(\w*) : (\w)\2 - the back reference will on work if same letters listed consecutively.
    Also, does this command work in even sets…I mean in case of word “loooveee” it get capture the third “o” and the third “e”.
  • If that expression matches, then match_substitution = r'\1\2\3' replaces it – again, using backreferences – with the matches that were made capturing subpatterns using parentheses in the search pattern.
  • So every matched part gets replaced by itself – except for the repeated character match \2 , which does not have grouping parentheses.

Am still completely sure if I get the above correctly. Can you explain a little more in detail?

Thanks,

Mohit

#4
  • what is the difference between (\w*) , (\w)* ?

Parentheses denote a group in the regex.


There could have been multiple conditions apart from just /w in the group so if you put the * outside a group, you then basically say: this group can repeat 0 or multiple times.

If you but the * after a pattern like \w it means that this particular pattern or character or whatever can repeat 0 or multiple times.

Simply put,

A group is different from a single character

For example, the above group could have easily been (\w@)* which should mean that capture all such occurrences in the text where a character class co-occurs with an @ sign 0 or multiple times.

You should read the regex documentation.

#5

I don’t know about the given expression, if you have tried it and it works do share the code and output for better clarity.

The expression that I wrote works for even “loooveee”
scr

#6
  • If that expression matches, then match_substitution = r'\1\2\3' replaces it – again, using backreferences – with the matches that were made capturing subpatterns using parentheses in the search pattern.
  • So every matched part gets replaced by itself – except for the repeated character match \2 , which does not have grouping parentheses.

I think if you read a bit of Regex documentation and the previous answers and then experiment with groups and regex on some examples, this would be much more clear.