Skip to content Skip to sidebar Skip to footer

Python Split On Multiple Delimiters Bug?

I was looking at the responses to this earlier-asked question: Split Strings with Multiple Delimiters? For my variant of this problem, I wanted to split on everything that wasn't

Solution 1:

The '-/ inside a character class created a range that includes a comma:

enter image description here

When you need to put a literal hyphen in a Python re pattern, put it:

  • at the start: [-A-Z] (matches an uppercase ASCII letter and -)
  • at the end: [A-Z()-] (matches an uppercase ASCII letter, (, ) or -)
  • after a valid range: [A-Z-+] (matches an uppercase ASCII letter, - or +)
  • or just escape it.

You cannot put it after a shorthand, right before a standalone symbol (as in [\w-+], it will cause a bad character range error). This is valid in .NET and some other regex flavors, but is not valid in Python re.

Put the hyphen at the end of it, or escape it.

Use

re.split(r"[^a-zA-Z0-9_'/-]+", b)

In Python 2.7, you may even contract it to

re.split(r"[^\w'/-]+", b)

Solution 2:

The '-/ is interpreted as range having ascii value from 39 to 47 which includes , having ascii value 44.

You will have to put - either at beginning or end or character class.

Post a Comment for "Python Split On Multiple Delimiters Bug?"