Skip to content Skip to sidebar Skip to footer

Python Regular Expressions, How To Extract Longest Of Overlapping Groups

How can I extract the longest of groups which start the same way For example, from a given string, I want to extract the longest match to either CS or CSI. I tried this '(CS|CSI).

Solution 1:

No, that's just how it works, at least in Perl-derived regex flavors like Python, JavaScript, .NET, etc.

http://www.regular-expressions.info/alternation.html


Solution 2:

Intrigued to know the right way of doing this, if it helps any you can always build up your regex like:

import re

string_to_look_in = "AUHDASOHDCSIAAOSLINDASOI"
string_to_match = "CSIABC"

re_to_use = "(" + "|".join([string_to_match[0:i] for i in range(len(string_to_match),0,-1)]) + ")"

re_result = re.search(re_to_use,string_to_look_in)

print string_to_look_in[re_result.start():re_result.end()]

Solution 3:

similar functionality is present in vim editor ("sequence of optionally matched atoms"), where e.g. col\%[umn] matches col in color, colum in columbus and full column.

i am not aware if similar functionality in python re, you can use nested anonymous groups, each one followed by ? quantifier, for that:

>>> import re
>>> words = ['color', 'columbus', 'column']
>>> rex = re.compile(r'col(?:u(?:m(?:n)?)?)?')
>>> for w in words: print rex.findall(w)
['col']
['colum']
['column']

Solution 4:

As Alan says, the patterns will be matched in the order you specified them.

If you want to match on the longest of overlapping literal strings, you need the longest one to appear first. But you can organize your strings longest-to-shortest automatically, if you like:

>>> '|'.join(sorted('cs csi miami vice'.split(), key=len, reverse=True))
'miami|vice|csi|cs'

Post a Comment for "Python Regular Expressions, How To Extract Longest Of Overlapping Groups"