Skip to content Skip to sidebar Skip to footer

Read A File And Try To Remove All Non Utf-8 Chars

I am trying to read a file and convert the string to a UTF-8 string, in order to remove some non utf-8 chars in the file string, file_str = open(file_path, 'r').read() file_str = f

Solution 1:

Remove the .decode('utf8') call. Your file data has already been decoded, because in Python 3 the open() call with text mode (the default) returned a file object that decodes the data to Unicode strings for you.

You probably do want to add the encoding to the open() call to make this explicit. Otherwise Python uses a system default, and that may not be UTF-8:

file_str = open(file_path, 'r', encoding='utf8').read()

For example, on Windows, the default codec is almost certainly going to be wrong for UTF-8 data, but you won't see the problem until you try to read the text; you'd find your have a Mojibake as the UTF-8 data is decoded using CP1252 or a similar 8-bit codec.

See the open() function documentation for further details.

Solution 2:

If you use

file_str = open(file_path, 'r', encoding='utf8', errors='ignore').read()

, then non-UTF-8 characters will essentially be ignored. Read the open() function documentation for details. The documentation has a section on the possible values for the errors parameter.

Post a Comment for "Read A File And Try To Remove All Non Utf-8 Chars"