Expected Behavior With Regular Expressions With Capturing-groups In Pandas' `str.extract()`
Solution 1:
First of all, the behavior of Pandas .str.extract()
is quite expected: it returns only the capturing group contents. The pattern used with extract
requires at least 1 capturing group:
pat : string
Regular expression pattern with capturing groups
If you use a named capturing group, the new column will be named after the named group.
The grep
command you provided can be reduced to
grep '\((.*)\)'
as grep
is capable of matching a line partially (does not require a full line match) and works on a per line basis: once a match is found the whole line is returned. To override that behavior, you may use -o
switch.
With grep
, you cannot return the capturing group contents. This can be worked around with PCRE regexp powered with -P
option, but it is not available on Mac, for example. sed
or awk
may help in those situations, too.
Solution 2:
Try using this:
movies['year']= movies['title'].str.extract('.*\((\d{4})\).*',expand=False)
- Set expand= True if you want it to return a DataFrame or when applying multiple capturing groups.
- A year is always composed of 4 digits. So the regex: \((\d{4})\) match any date between parentheses.
Post a Comment for "Expected Behavior With Regular Expressions With Capturing-groups In Pandas' `str.extract()`"