Expected Behavior With Regular Expressions With Capturing-groups In Pandas' `str.extract()`
Solution 1:
First of all, the behavior of Pandas .str.extract() is quite expected: it returns only the capturing group contents. The pattern used with extract requires at least 1 capturing group:
pat : string
Regular expression pattern with capturing groups
If you use a named capturing group, the new column will be named after the named group.
The grep command you provided can be reduced to
grep '\((.*)\)'
as grep is capable of matching a line partially (does not require a full line match) and works on a per line basis: once a match is found the whole line is returned. To override that behavior, you may use -o switch.
With grep, you cannot return the capturing group contents. This can be worked around with PCRE regexp powered with -P option, but it is not available on Mac, for example. sed or awk may help in those situations, too.
Solution 2:
Try using this:
movies['year']= movies['title'].str.extract('.*\((\d{4})\).*',expand=False)
- Set expand= True if you want it to return a DataFrame or when applying multiple capturing groups.
- A year is always composed of 4 digits. So the regex: \((\d{4})\) match any date between parentheses.
Post a Comment for "Expected Behavior With Regular Expressions With Capturing-groups In Pandas' `str.extract()`"