I Want To Convert Very Large Csv Data To Hdf5 In Python
Solution 1:
There are A LOT of ways to approach this problem. Before I show some code, a suggestion: Consider your data schema carefully. It is important. It will affect how easily you access and use the data. For example, your proposed schema makes it easy to access the data for one Firm for one Date. What if you want all the data for one Firm for across a range of dates? Or you want all the data for all firms for one date? Both will require you to manipulate multiple arrays after you access the data.
Although counter intuitive, you may want to store the CSV data as a single Group/Dataset. I will show an example of each in the 2 methods below. Both methods below use np.genfromtxt
to read the CSV data. The optional parameter names=True
will read the headers from row one in your CSV file if you have them. Omit names=
if you don't have a header row and you will get default field names (f1, f2, f3, etc)
. My sample data is included at the end.
Method 1: using h5py Group Names: Date Dataset Names: Firms
import numpy as np
import h5py
csv_recarr = np.genfromtxt('SO_57120995.csv',delimiter=',',dtype=None, names=True, encoding=None)
print (csv_recarr.dtype)
with h5py.File('SO_57120995.h5','w') as h5f :
forrowin csv_recarr:
date=row[0]
grp = h5f.require_group(date)
firm=row[1]
# convertrow data toget list ofall valuei entries
row_data=row.item()[2:]
h5f[date].create_dataset(firm,data=row_data)
Method 2: using PyTables All data stored in Dataset: /CSV_Data
import numpy as np
import tables as tb
csv_recarr = np.genfromtxt('SO_57120995.csv',delimiter=',',dtype=None, names=True, encoding=None)
print (csv_recarr.dtype)
with tb.File('SO_57120995_2.h5','w') as h5f :
# this should work, but only first string character is loaded:#dset = h5f.create_table('/','CSV_Data',obj=csv_recarr)# create empty table
dset = h5f.create_table('/','CSV_Data',description=csv_recarr.dtype)
#workaround to add CSV data one line at a timefor row in csv_recarr:
append_list=[]
append_list.append(row.item()[:])
dset.append(append_list)
# Example to extract array of data based on field name
firm_arr = dset.read_where('Firm==b"Firm1"')
print (firm_arr)
Example data:
Date,Firm,value1,value2,value3,value4,value5,value6,value7,value8,value9,value102019-07-01,Firm1,7.634758e-01,5.781637e-01,8.531480e-01,8.823769e-01,5.780567e-01,3.587480e-01,4.065076e-01,8.520372e-02,3.392133e-01,1.104916e-012019-07-01,Firm2,6.457887e-01,6.150677e-01,3.501075e-01,8.886556e-01,5.379832e-01,4.561159e-01,4.773242e-01,7.302280e-01,6.018719e-01,3.835672e-012019-07-01,Firm3,3.641129e-01,8.356681e-01,7.783146e-01,1.735361e-01,8.610319e-01,1.360989e-01,5.025533e-01,5.292365e-01,4.964461e-01,7.309130e-012019-07-02,Firm1,4.128258e-01,1.339008e-01,3.530394e-02,5.293509e-01,3.608783e-01,6.647519e-01,2.898612e-01,5.632466e-01,5.981161e-01,9.149318e-012019-07-02,Firm2,1.037654e-01,3.717925e-01,4.876283e-01,5.852448e-01,4.689806e-01,2.508458e-01,7.243468e-02,3.510882e-01,8.290331e-01,7.808357e-012019-07-02,Firm3,8.443163e-01,5.408783e-01,8.278920e-01,8.454836e-01,7.331165e-02,4.167235e-01,6.187155e-01,6.114338e-01,2.299935e-01,5.206390e-012019-07-03,Firm1,2.281612e-01,2.660087e-02,3.809895e-01,8.032823e-01,2.492683e-03,9.600432e-02,5.059484e-01,1.795972e-01,2.174838e-01,3.578077e-012019-07-03,Firm2,2.403236e-01,1.497736e-01,7.357259e-01,2.501746e-01,2.826287e-01,3.335158e-01,7.742914e-01,1.773830e-01,8.407694e-01,7.466135e-012019-07-03,Firm3,8.806318e-01,1.164414e-01,6.791358e-01,4.752967e-01,3.695451e-01,9.728813e-01,3.553896e-01,2.559315e-01,6.942147e-01,2.701471e-012019-07-04,Firm1,2.153168e-01,5.169252e-01,5.136280e-01,7.517068e-01,1.977217e-01,7.221689e-01,5.877799e-01,9.099813e-02,9.073012e-03,5.946624e-012019-07-04,Firm2,8.275230e-01,9.725115e-01,5.218725e-03,7.728741e-01,4.371698e-01,3.593862e-02,3.448388e-01,7.443235e-01,2.606604e-01,9.888835e-022019-07-04,Firm3,8.599242e-01,8.336458e-01,1.451350e-01,9.777518e-02,3.335788e-01,1.117006e-01,9.105203e-01,3.478112e-01,8.948065e-01,3.105299e-01
Post a Comment for "I Want To Convert Very Large Csv Data To Hdf5 In Python"