Reading Csv With Separator In Python Dask
I am trying to create a DataFrame by reading a csv file separated by '#####' 5 hashes The code is: import dask.dataframe as dd df = dd.read_csv('D:\temp.csv',sep='#####',engine='p
Solution 1:
Read the entire file in as dtype=object
, meaning all columns will be interpreted as type object
. This should read in correctly, getting rid of the #####
in each row. From there you can turn it into a pandas frame using the compute()
method. Once the data is in a pandas frame, you can use the pandas infer_objects
method to update the types without having to hard code them.
import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv',sep='#####',dtype='object').compute()
res = df.infer_objects()
Solution 2:
If you want to keep the entire file as a dask dataframe, I had some success with a dataset with a large number of columns simply by increasing the number of bytes sampled in read_csv
.
For example:
import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv', sep='#####', sample = 1000000) # increase to 1e6 bytes
df.head()
This can resolve some type inference issues, although unlike Benjamin Cohen's answer, you would need to find the right values to choose for sample/
Post a Comment for "Reading Csv With Separator In Python Dask"