Skip to content Skip to sidebar Skip to footer

Efficiently Identify Duplicates In Large List (500,000+)

I have a large list of DOI's and I need the most efficient way to identify the DOI's which are repeated (ie. print out the index and the DOI for values which are repeated.) The arr

Solution 1:

Try storing them in a set instead. You can append the duplicates to a single list, which might speed things up:

seen = set()
dupes = []

for i, doi in enumerate(doiList):
    if doi not in seen:
        seen.add(doi)
    else:
        dupes.append(i)

At this point, seen contains all the distinct doi values, while dupes contains all the 2nd, 3rd, etc. indexes of duplicate values. You can look them up in doiList to determine which index corresponds to which value.

To get some more performance out of this, you can cache the methods:

seen = set()
seen_add = seen.add
dupes = []
dupes_append = dupes.append

for i, doi in enumerate(doiList):
    if doi not in seen:
        seen_add(doi)
    else:
        dupes_append(i)

Solution 2:

Here's a full solution that returns the same data set as your example, just more than twice faster (at the expense of some memory):

defidentify_duplicates(data):
    lookup = {}  # store our quick lookup here
    result = {}  # store for our final resultfor i, v inenumerate(data):
        if v in lookup:  # if already in the lookup table it's a duplicateif v notin result:  # add it to the result set
                result[v] = lookup[v]
            lookup[v][1] += 1# increase duplicate countelse:
            lookup[v] = [i, 0]  # default state for non-duplicatesreturn result

print(identify_duplicates(doiList))
# prints: {'10.1016/j.ijnurstu.2017.05.011 [doi]': [0, 1]}

The stored index is the first occurrence of the found duplicate as in your example. If you want to store all the duplicate indexes you can add lookup[v].append(i) after the lookup[v][1] += 1 line, but then the data might look weird (the structure would be [first_index, number_of_occurrences, second_index, third_index...])

Instead just flip the stored parameters in lookup[v] modifications - lookup[v] = [0, i] instead of lookup[v] = [i, 0] and lookup[v][0] += 1 instead of lookup[v][1] += 1, then lookup[v].append(i) after it will give you a nice result in the form of: [number_of_occurrences, first_index, second_index, third_index...].

Post a Comment for "Efficiently Identify Duplicates In Large List (500,000+)"