Skip to content Skip to sidebar Skip to footer

Reading Russian Language Data From Csv

I have some data in CSV file that are in Russian: 2-комнатная квартира РДТ', мкр Тастак-3, Аносова — Толе би;Алматы 2-комна

Solution 1:

\ea is the windows-1251 / cp5347 encoding for к. Therefore, you need to use windows-1251 decoding, not UTF-8.

In Python 2.7, the CSV library does not support Unicode properly - See "Unicode" in https://docs.python.org/2/library/csv.html

They propose a simple work around using:

classUnicodeReader:
    """
    A CSV reader which will iterate over lines in the CSV file "f",
    which is encoded in the given encoding.
    """def__init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)

    defnext(self):
        row = self.reader.next()
        return [unicode(s, "utf-8") for s in row]

    def__iter__(self):
        return self

This would allow you to do:

defloadCsv(filename):
    lines = UnicodeReader(open(filename, "rb"), delimiter=";", encoding="windows-1251" )
    # if you really need lists then uncomment the next line# this will let you do call exact lines by doing `line_12 = lines[12]`# return list(lines)# this will return an "iterator", so that the file is read on each call# use this if you'll do a `for x in x`return lines 

If you try to print dataset, then you'll get a representation of a list within a list, where the first list is rows, and the second list is colums. Any encoded bytes or literals will be represented with \x or \u. To print the values, do:

for csv_line in loadCsv("myfile.csv"):
    printu", ".join(csv_line)

If you need to write your results to another file (fairly typical), you could do:

with io.open("my_output.txt", "w", encoding="utf-8") as my_ouput:
    for csv_line in loadCsv("myfile.csv"):
        my_output.write(u", ".join(csv_line))

This will automatically convert/encode your output to UTF-8.

Solution 2:

You cant try:

import pandas as pd 
pd.read_csv(path_file , "cp1251")

or

import csv
withopen(path_file,  encoding="cp1251", errors='ignore') as source_file:
        reader = csv.reader(source_file, delimiter=",") 

Solution 3:

Can your .csv be another encoding, not UTF-8? (considering error message, it even should be). Try other cyrillic encodings such as Windows-1251 or CP866 or KOI8.

Solution 4:

In py3:

import csv

path = 'C:/Users/me/Downloads/sv.csv'withopen(path, encoding="UTF8") as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

Post a Comment for "Reading Russian Language Data From Csv"