Read A File And Try To Remove All Non Utf-8 Chars
Solution 1:
Remove the .decode('utf8')
call. Your file data has already been decoded, because in Python 3 the open()
call with text mode (the default) returned a file object that decodes the data to Unicode strings for you.
You probably do want to add the encoding to the open()
call to make this explicit. Otherwise Python uses a system default, and that may not be UTF-8:
file_str = open(file_path, 'r', encoding='utf8').read()
For example, on Windows, the default codec is almost certainly going to be wrong for UTF-8 data, but you won't see the problem until you try to read the text; you'd find your have a Mojibake as the UTF-8 data is decoded using CP1252 or a similar 8-bit codec.
See the open()
function documentation for further details.
Solution 2:
If you use
file_str = open(file_path, 'r', encoding='utf8', errors='ignore').read()
, then non-UTF-8 characters will essentially be ignored. Read the open()
function documentation for details. The documentation has a section on the possible values for the errors
parameter.
Post a Comment for "Read A File And Try To Remove All Non Utf-8 Chars"