Skip to content Skip to sidebar Skip to footer

Reading Csv With Separator In Python Dask

I am trying to create a DataFrame by reading a csv file separated by '#####' 5 hashes The code is: import dask.dataframe as dd df = dd.read_csv('D:\temp.csv',sep='#####',engine='p

Solution 1:

Read the entire file in as dtype=object, meaning all columns will be interpreted as type object. This should read in correctly, getting rid of the ##### in each row. From there you can turn it into a pandas frame using the compute() method. Once the data is in a pandas frame, you can use the pandas infer_objects method to update the types without having to hard code them.

import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv',sep='#####',dtype='object').compute()
res = df.infer_objects()

Solution 2:

If you want to keep the entire file as a dask dataframe, I had some success with a dataset with a large number of columns simply by increasing the number of bytes sampled in read_csv.

For example:

import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv', sep='#####', sample = 1000000) # increase to 1e6 bytes
df.head()

This can resolve some type inference issues, although unlike Benjamin Cohen's answer, you would need to find the right values to choose for sample/


Post a Comment for "Reading Csv With Separator In Python Dask"