Efficiently Identify Duplicates In Large List (500,000+)
Solution 1:
Try storing them in a set
instead. You can append the duplicates to a single list, which might speed things up:
seen = set()
dupes = []
for i, doi in enumerate(doiList):
if doi not in seen:
seen.add(doi)
else:
dupes.append(i)
At this point, seen
contains all the distinct doi values, while dupes
contains all the 2nd, 3rd, etc. indexes of duplicate values. You can look them up in doiList
to determine which index corresponds to which value.
To get some more performance out of this, you can cache the methods:
seen = set()
seen_add = seen.add
dupes = []
dupes_append = dupes.append
for i, doi in enumerate(doiList):
if doi not in seen:
seen_add(doi)
else:
dupes_append(i)
Solution 2:
Here's a full solution that returns the same data set as your example, just more than twice faster (at the expense of some memory):
defidentify_duplicates(data):
lookup = {} # store our quick lookup here
result = {} # store for our final resultfor i, v inenumerate(data):
if v in lookup: # if already in the lookup table it's a duplicateif v notin result: # add it to the result set
result[v] = lookup[v]
lookup[v][1] += 1# increase duplicate countelse:
lookup[v] = [i, 0] # default state for non-duplicatesreturn result
print(identify_duplicates(doiList))
# prints: {'10.1016/j.ijnurstu.2017.05.011 [doi]': [0, 1]}
The stored index is the first occurrence of the found duplicate as in your example. If you want to store all the duplicate indexes you can add lookup[v].append(i)
after the lookup[v][1] += 1
line, but then the data might look weird (the structure would be [first_index, number_of_occurrences, second_index, third_index...]
)
Instead just flip the stored parameters in lookup[v]
modifications - lookup[v] = [0, i]
instead of lookup[v] = [i, 0]
and lookup[v][0] += 1
instead of lookup[v][1] += 1
, then lookup[v].append(i)
after it will give you a nice result in the form of: [number_of_occurrences, first_index, second_index, third_index...]
.
Post a Comment for "Efficiently Identify Duplicates In Large List (500,000+)"