Pytables Duplicates 2.5 Giga Rows

April 14, 2024 Post a Comment

I currently have a .h5 file, with a table in it consisting of three columns: a text columns of 64 chars, an UInt32 column relating to the source of the text and a UInt32 column whi

Solution 1:

Your dataset is 10x larger than the referenced solution (2.5e9 vs 500e6 rows). Have you done any testing to identify where the time is spent? The table.itersorted() method may not be linear - and might be resource intensive. (I don't have any experience with itersorted.)

Here is a process that might be faster:

Extract a NumPy array of the hash field (column xhashx )
Find the unique hash values
Loop thru the unique hash values and extract a NumPy array of rows that match each value
Do your uniqueness tests against the rows in this extracted array
Write the unique rows to your new file

Code for this process below: Note: This has been not tested, so may have small syntax or logic gaps

Baca Juga

# Step 1: Get a Numpy arrayof the 'xhashx' field/colmu only:
hash_arr = table.read(field='xhashx')
# Step 2: Getnewarraywithuniquevaluesonly:
hash_arr_u = np.unique(hash_arr)

# Alternately, combine first2 steps in a single step
hash_arr_u = np.unique(table.read(field='xhashx'))

# Step 3a: Loop onrowsunique hash valuesfor hash_test in hash_arr_u :

# Step 3b: Get an arraywithallrows that match this unique hash value
     match_row_arr = table.read_where('xhashx==hash_test')

# Step 4: Checkforrowswithuniquevalues
# Check the hash row count. 
# If there isonly1row, uniqueness tested not required  
     if match_row_arr.shape[0] ==1 :
     # onlyonerow, so write it to new.table

     else :
     # checkforuniquerows
     # then write uniquerowsto new.table

##################################################
# np.unique has an option to save the hash counts
# these can be used as a test in the loop
(hash_arr_u, hash_cnts) = np.unique(table.read(field='xhashx'), return_counts=True)

# Loop onrowsin the arrayofunique hash valuesfor cnt inrange(hash_arr_u.shape[0]) :

# Get an arraywithallrows that match this unique hash value
     match_row_arr = table.read_where('xhashx==hash_arr_u(cnt)')

# Check the hash row count. 
# If there isonly1row, uniqueness tested not required  
     if hash_cnts[cnt] ==1 :
     # onlyonerow, so write it to new.table

     else :
     # checkforuniquerows
     # then write uniquerowsto new.table

Learn Python Tutorials

Pytables Duplicates 2.5 Giga Rows

Solution 1:

Post a Comment for "Pytables Duplicates 2.5 Giga Rows"