Pyarrow Find Bad Lines In Csv To Parquet Conversion

March 21, 2024 Post a Comment

I'm getting CSV column #10: CSV conversion error to string: invalid UTF8 data while converting a large csv to parquet. From the looks of the error, it seems like appropriate column

Solution 1:

There is no option today to report the line number or the failing line. There is some ongoing work to improve error handling but even that work does not yet reveal the line number of decode errors. I'd recommend creating a JIRA issue.

As @0x26res correctly stated, you can specify the column as binary and it then inspect it manually in memory. You can use the cast compute function to cast from binary to string and that will perform UTF8 validation but, unfortunately, it does not report the failed index today either.

As a workaround you can use the pandas CSV parser which should give you the byte offset of the failure:

Baca Juga

>>>import pandas>>>pandas.read_csv("/tmp/blah.csv")
Traceback (most recent call last):
  ... # Omitted for brevity
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 29: invalid start byte

Learn Python Tutorials

Pyarrow Find Bad Lines In Csv To Parquet Conversion

Solution 1:

Post a Comment for "Pyarrow Find Bad Lines In Csv To Parquet Conversion"