Find Out The Unicode Script Of A Character
Solution 1:
I was hoping someone's done it before, but apparently not, so here's what I've ended up with. The module below (I call it unicodedata2
) extends unicodedata
and provides script_cat(chr)
which returns a tuple (Script name, Category) for a unicode char. Example:
# coding=utf8
import unicodedata2
print unicodedata2.script_cat(u'Ф') #('Cyrillic', 'L')
print unicodedata2.script_cat(u'の') #('Hiragana', 'Lo')
print unicodedata2.script_cat(u'★') #('Common', 'So')
The module: https://gist.github.com/2204527
Solution 2:
It seems to me that the Python unicodedata module contains tools for accessing the main file in the Unicode database but nothing for the other files: “The data in this database is based on the UnicodeData.txt file”
The script information is in the Scripts.txt file. It is of relatively simple format (described in UAX #44) and not horribly large (131 kilobytes), so you might consider parsing it in your program. Note that in the Unicode classification, there’s the “Common” script that contains characters used in different scripts, like punctuation marks.
Solution 3:
You can use ord
to retrieve the numeric value of a character (it works on both unicode and byte strings of length 1).
The next step, unfortunately, will involve you then testing against the ranges. Possibly the data here will be of assistance: http://cldr.unicode.org/index/downloads
Solution 4:
The only way I know of is unfortunately to get the Unicode code point with ord()
and then use your own table (by using http://en.wikipedia.org/wiki/Unicode#Standardized_subsets and more). A preliminary conversion to some normal form may be in order, so as to handle the fact that a single "written" character can be expressed with different sequences of code points (the unicodedata module helps, here).
Solution 5:
Oftentimes it is just enough to detect if a certain script is used, and then you can use the unicodedata.name
with prefix matching. For example to find out whether a letter is Cyrillic, you can use
class CharacterNamePrefixTester(dict):
def __init__(self, prefix):
self.prefix = prefix
def __missing__(self, key):
self[key] = unicodedata.name(key, '').startswith(self.prefix)
return self[key]
>>> cyrillic = CharaterNamePrefixTester('CYRILLIC ')
>>> cyrillic['й']
True
>>> cyrillic['a']
False
The dictionary is built lazily but the truth values are memoized so that future lookups of the same letter will be faster.
Post a Comment for "Find Out The Unicode Script Of A Character"