Skip to content Skip to sidebar Skip to footer

Expected Behavior With Regular Expressions With Capturing-groups In Pandas' `str.extract()`

I'm trying to get a grasp on regular expressions and I came across with the one included inside the str.extract method: movies['year']=movies['title'].str.extract('.*\((.*)\).*',ex

Solution 1:

First of all, the behavior of Pandas .str.extract() is quite expected: it returns only the capturing group contents. The pattern used with extract requires at least 1 capturing group:

pat : string
Regular expression pattern with capturing groups

If you use a named capturing group, the new column will be named after the named group.

The grep command you provided can be reduced to

grep '\((.*)\)'

as grep is capable of matching a line partially (does not require a full line match) and works on a per line basis: once a match is found the whole line is returned. To override that behavior, you may use -o switch.

With grep, you cannot return the capturing group contents. This can be worked around with PCRE regexp powered with -P option, but it is not available on Mac, for example. sed or awk may help in those situations, too.


Solution 2:

Try using this:

movies['year']= movies['title'].str.extract('.*\((\d{4})\).*',expand=False)

  • Set expand= True if you want it to return a DataFrame or when applying multiple capturing groups.
  • A year is always composed of 4 digits. So the regex: \((\d{4})\) match any date between parentheses.

Post a Comment for "Expected Behavior With Regular Expressions With Capturing-groups In Pandas' `str.extract()`"