Skip to content Skip to sidebar Skip to footer

I Want To Convert Very Large Csv Data To Hdf5 In Python

I have a very large csv data. It looks like this. [Date, Firm name, value 1, value 2, ..., value 60] I want to convert this to a hdf5 file. For example, let's say I have two dates

Solution 1:

There are A LOT of ways to approach this problem. Before I show some code, a suggestion: Consider your data schema carefully. It is important. It will affect how easily you access and use the data. For example, your proposed schema makes it easy to access the data for one Firm for one Date. What if you want all the data for one Firm for across a range of dates? Or you want all the data for all firms for one date? Both will require you to manipulate multiple arrays after you access the data.

Although counter intuitive, you may want to store the CSV data as a single Group/Dataset. I will show an example of each in the 2 methods below. Both methods below use np.genfromtxt to read the CSV data. The optional parameter names=True will read the headers from row one in your CSV file if you have them. Omit names= if you don't have a header row and you will get default field names (f1, f2, f3, etc). My sample data is included at the end.

Method 1: using h5py Group Names: Date Dataset Names: Firms

import numpy as np
import h5py

csv_recarr = np.genfromtxt('SO_57120995.csv',delimiter=',',dtype=None, names=True, encoding=None)
print (csv_recarr.dtype)

with h5py.File('SO_57120995.h5','w') as h5f :

    forrowin csv_recarr:   
        date=row[0]
        grp = h5f.require_group(date)

        firm=row[1]
    # convertrow data toget list ofall valuei entries
        row_data=row.item()[2:]
        h5f[date].create_dataset(firm,data=row_data)

Method 2: using PyTables All data stored in Dataset: /CSV_Data

import numpy as np
import tables as tb

csv_recarr = np.genfromtxt('SO_57120995.csv',delimiter=',',dtype=None, names=True, encoding=None)
print (csv_recarr.dtype)

with tb.File('SO_57120995_2.h5','w') as h5f :
    # this should work, but only first string character is loaded:#dset = h5f.create_table('/','CSV_Data',obj=csv_recarr)# create empty table
    dset = h5f.create_table('/','CSV_Data',description=csv_recarr.dtype)

    #workaround to add CSV data one line at a timefor row in csv_recarr:
        append_list=[]
        append_list.append(row.item()[:])
        dset.append(append_list)

# Example to extract array of data based on field name
    firm_arr = dset.read_where('Firm==b"Firm1"')
    print (firm_arr)

Example data:

Date,Firm,value1,value2,value3,value4,value5,value6,value7,value8,value9,value102019-07-01,Firm1,7.634758e-01,5.781637e-01,8.531480e-01,8.823769e-01,5.780567e-01,3.587480e-01,4.065076e-01,8.520372e-02,3.392133e-01,1.104916e-012019-07-01,Firm2,6.457887e-01,6.150677e-01,3.501075e-01,8.886556e-01,5.379832e-01,4.561159e-01,4.773242e-01,7.302280e-01,6.018719e-01,3.835672e-012019-07-01,Firm3,3.641129e-01,8.356681e-01,7.783146e-01,1.735361e-01,8.610319e-01,1.360989e-01,5.025533e-01,5.292365e-01,4.964461e-01,7.309130e-012019-07-02,Firm1,4.128258e-01,1.339008e-01,3.530394e-02,5.293509e-01,3.608783e-01,6.647519e-01,2.898612e-01,5.632466e-01,5.981161e-01,9.149318e-012019-07-02,Firm2,1.037654e-01,3.717925e-01,4.876283e-01,5.852448e-01,4.689806e-01,2.508458e-01,7.243468e-02,3.510882e-01,8.290331e-01,7.808357e-012019-07-02,Firm3,8.443163e-01,5.408783e-01,8.278920e-01,8.454836e-01,7.331165e-02,4.167235e-01,6.187155e-01,6.114338e-01,2.299935e-01,5.206390e-012019-07-03,Firm1,2.281612e-01,2.660087e-02,3.809895e-01,8.032823e-01,2.492683e-03,9.600432e-02,5.059484e-01,1.795972e-01,2.174838e-01,3.578077e-012019-07-03,Firm2,2.403236e-01,1.497736e-01,7.357259e-01,2.501746e-01,2.826287e-01,3.335158e-01,7.742914e-01,1.773830e-01,8.407694e-01,7.466135e-012019-07-03,Firm3,8.806318e-01,1.164414e-01,6.791358e-01,4.752967e-01,3.695451e-01,9.728813e-01,3.553896e-01,2.559315e-01,6.942147e-01,2.701471e-012019-07-04,Firm1,2.153168e-01,5.169252e-01,5.136280e-01,7.517068e-01,1.977217e-01,7.221689e-01,5.877799e-01,9.099813e-02,9.073012e-03,5.946624e-012019-07-04,Firm2,8.275230e-01,9.725115e-01,5.218725e-03,7.728741e-01,4.371698e-01,3.593862e-02,3.448388e-01,7.443235e-01,2.606604e-01,9.888835e-022019-07-04,Firm3,8.599242e-01,8.336458e-01,1.451350e-01,9.777518e-02,3.335788e-01,1.117006e-01,9.105203e-01,3.478112e-01,8.948065e-01,3.105299e-01

Post a Comment for "I Want To Convert Very Large Csv Data To Hdf5 In Python"