Skip to content Skip to sidebar Skip to footer

Unable To Process Accented Words Using Nltk Tokeniser

I'm trying to compute the frequencies of words in an utf-8 encoded text file with the following code. Having successfully tokenized the file content and then looping through the wo

Solution 1:

If you're using py2.x, reset default encoding to 'utf8':

import sys
reload(sys)
sys.setdefaultencoding('utf8')

Alternatively, you can use a ucsv module, see see General Unicode/UTF-8 support for csv files in Python 2.6

or use io.open():

$ echo """rt annesorose envie crêpes
> envoyé jerrylee bonjour monde dimanche crepes dimanche
> The output written in a file is destroying certain words.
> bonnes crepes tour nouveau vélo
> aime crepe soleil ça fera bien recharger batteries vu jours hard annoncent""" > someutf8.txt
$ python
>>> import io, csv
>>> text = io.open('someutf8.txt', 'r', encoding='utf8').read().split('\n')
>>> for row in text:
... print row
... 
rt annesorose envie crêpes
envoyé jerrylee bonjour monde dimanche crepes dimanche
The output written in a file is destroying certain words.
bonnes crepes tour nouveau vélo
aime crepe soleil ça fera bien recharger batteries vu jours hard annoncent

Lastly, rather than using such a complex reading and counting module, simply use FreqDist in NLTK, see section 3.1 from http://www.nltk.org/book/ch01.html

Or personally, i prefer collections.Counter:

$ python
>>> import io
>>> text = io.open('someutf8.txt', 'r', encoding='utf8').read()
>>> from collections import Counter
>>> Counter(word_tokenize(text))
Counter({u'crepes': 2, u'dimanche': 2, u'fera': 1, u'certain': 1, u'is': 1, u'bonnes': 1, u'v\xe9lo': 1, u'batteries': 1, u'envoy\xe9': 1, u'vu': 1, u'file': 1, u'in': 1, u'The': 1, u'rt': 1, u'jerrylee': 1, u'destroying': 1, u'bien': 1, u'jours': 1, u'.': 1, u'written': 1, u'annesorose': 1, u'annoncent': 1, u'nouveau': 1, u'envie': 1, u'hard': 1, u'cr\xeapes': 1, u'\xe7a': 1, u'monde': 1, u'words': 1, u'bonjour': 1, u'a': 1, u'crepe': 1, u'soleil': 1, u'tour': 1, u'aime': 1, u'output': 1, u'recharger': 1})
>>> myFreqDist = Counter(word_tokenize(text))
>>> for word, freq in myFreqDist.items():
... print word, freq
... 
fera 1
crepes 2
certain 1is1
bonnes 1
vélo 1
batteries 1
envoyé 1
vu 1
file 1in1
The 1
rt 1
jerrylee 1
destroying 1
bien 1
jours 1
. 1
written 1
dimanche 2
annesorose 1
annoncent 1
nouveau 1
envie 1
hard 1
crêpes 1
ça 1
monde 1
words 1
bonjour 1
a 1
crepe 1
soleil 1
tour 1
aime 1
output 1
recharger 1

Post a Comment for "Unable To Process Accented Words Using Nltk Tokeniser"