Skip to content Skip to sidebar Skip to footer

Pyarrow Find Bad Lines In Csv To Parquet Conversion

I'm getting CSV column #10: CSV conversion error to string: invalid UTF8 data while converting a large csv to parquet. From the looks of the error, it seems like appropriate column

Solution 1:

There is no option today to report the line number or the failing line. There is some ongoing work to improve error handling but even that work does not yet reveal the line number of decode errors. I'd recommend creating a JIRA issue.

As @0x26res correctly stated, you can specify the column as binary and it then inspect it manually in memory. You can use the cast compute function to cast from binary to string and that will perform UTF8 validation but, unfortunately, it does not report the failed index today either.

As a workaround you can use the pandas CSV parser which should give you the byte offset of the failure:

>>>import pandas>>>pandas.read_csv("/tmp/blah.csv")
Traceback (most recent call last):
  ... # Omitted for brevity
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 29: invalid start byte

Post a Comment for "Pyarrow Find Bad Lines In Csv To Parquet Conversion"