Skip to content Skip to sidebar Skip to footer

Regex [a-z] Do Not Recognize Local Characters

I've checked other problems and I've read their solutions, they do not work. I've tested the regular expression it works on non-locale characters. Code is simply to find any capita

Solution 1:

The problem is that Ş is not in the range [A-Z]. That range is the class of all characters whose codepoints lie U+0040 and U+005A (inclusive). (If you were using bytes-mode, it would be all bytes between 0x40 and 0x5A.) And Ş is U+0153 (or, e.g., 0xAA in bytes, assuming latin2). Which isn't in that range.

And using a locale won't change that. As re.LOCALE explains, all it does is:

Make \w, \W, \b, \B and case-insensitive matching dependent on the current locale.

Also, you almost never want to use re.LOCALE. As the docs say:

The use of this flag is discouraged as the locale mechanism is very unreliable, it only handles one “culture” at a time, and it only works with 8-bit locales.

If you only care about a single script, you can build a class of the appropriate ranges for that script.

If you want to work with all scripts, you need to build a class out of a Unicode character class like Lu for "all uppercase letters". Unfortunately, Python's re doesn't have a mechanism for doing this directly. You can build a giant class out of the information in unicodedata, but that's pretty annoying:

Lu = '[' + ''.join(chr(c) for c inrange(0, 0x10ffff) 
                   if unicodedata.category(chr(c)) == 'Lu') + ']'

And then:

pattern = re.compile(r"([\w]{1})()(" + Lu + r"{1})", re.U)

… or maybe:

pattern = re.compile(rf"([\w]{{1}})()({Lu}{{1}})", re.U)

But the good news is that part of the reason re doesn't have any way to specify Unicode classes is that for a long time, the plan was to replace re with a new module, so many suggested new features for re were rejected. But the good news is that the intended new module is available as a third-party library, regex. It works just fine, and is a near drop-in replacement for re; it was just improving too quickly to lock it down to the slower Python release schedule. If you install it, then you can write your code this way:

import regex
corp = "minikŞeker bir kedi"
pattern = regex.compile(r"([\w]{1})()(\p{Lu}{1})", re.U)
corp = regex.sub(pattern, r"\1 \3", corp)
print(corp)

The only change I made was to replace re with regex, and then use \p{Lu} instead of [A-Z].

There are, of course, lots of other regex engines out there, and many of them also support Unicode character classes. Most of those that do follow some variation on the same \p syntax. (They all copied it from Perl, but the details differ—e.g., regex's idea of Unicode classes comes from the unicodedata module, while PCRE and PCRE2 attempt to be as close to Perl as possible, and so on.)

Solution 2:

abarnet's answer is great, but if all you want to do is find upper case characters, str.isupper() works without the need for an extra module.

>>>foo = "minikŞeker bir kedi">>>for i, c inenumerate(foo):...if c.isupper():...print(foo[i-1:i+2])...break... 
kŞe

or perhaps

>>>foo = "minikŞeker bir kedi">>>''.join((' 'if c.isupper() else'') + c for c in foo)
'minik Şeker bir kedi'

Post a Comment for "Regex [a-z] Do Not Recognize Local Characters"