Pytables Duplicates 2.5 Giga Rows
I currently have a .h5 file, with a table in it consisting of three columns: a text columns of 64 chars, an UInt32 column relating to the source of the text and a UInt32 column whi
Solution 1:
Your dataset is 10x larger than the referenced solution (2.5e9 vs 500e6 rows). Have you done any testing to identify where the time is spent? The table.itersorted()
method may not be linear - and might be resource intensive. (I don't have any experience with itersorted.)
Here is a process that might be faster:
- Extract a NumPy array of the hash field (column
xhashx
) - Find the unique hash values
- Loop thru the unique hash values and extract a NumPy array of rows that match each value
- Do your uniqueness tests against the rows in this extracted array
- Write the unique rows to your new file
Code for this process below: Note: This has been not tested, so may have small syntax or logic gaps
# Step 1: Get a Numpy arrayof the 'xhashx' field/colmu only:
hash_arr = table.read(field='xhashx')
# Step 2: Getnewarraywithuniquevaluesonly:
hash_arr_u = np.unique(hash_arr)
# Alternately, combine first2 steps in a single step
hash_arr_u = np.unique(table.read(field='xhashx'))
# Step 3a: Loop onrowsunique hash valuesfor hash_test in hash_arr_u :
# Step 3b: Get an arraywithallrows that match this unique hash value
match_row_arr = table.read_where('xhashx==hash_test')
# Step 4: Checkforrowswithuniquevalues
# Check the hash row count.
# If there isonly1row, uniqueness tested not required
if match_row_arr.shape[0] ==1 :
# onlyonerow, so write it to new.table
else :
# checkforuniquerows
# then write uniquerowsto new.table
##################################################
# np.unique has an option to save the hash counts
# these can be used as a test in the loop
(hash_arr_u, hash_cnts) = np.unique(table.read(field='xhashx'), return_counts=True)
# Loop onrowsin the arrayofunique hash valuesfor cnt inrange(hash_arr_u.shape[0]) :
# Get an arraywithallrows that match this unique hash value
match_row_arr = table.read_where('xhashx==hash_arr_u(cnt)')
# Check the hash row count.
# If there isonly1row, uniqueness tested not required
if hash_cnts[cnt] ==1 :
# onlyonerow, so write it to new.table
else :
# checkforuniquerows
# then write uniquerowsto new.table
Post a Comment for "Pytables Duplicates 2.5 Giga Rows"