Skip to content Skip to sidebar Skip to footer

How To Remove Duplicate Lines From A Text File And The Unique Related To This Duplicate

How might I remove duplicate lines from a file and also the unique related to this duplicate? Example: Input file: line 1 : Messi , 1 line 2 : Messi , 2 line 3 : CR7

Solution 1:

Have you tried Counter? This works for example:

importcollectionsa= [1, 1, 2]

out = [k for k, v in collections.Counter(a).items() ifv== 1]
print(out)

Output: [2] Or with a longer example:

importcollectionsa= [1, 1, 1, 2, 4, 4, 4, 5, 3]

out = [k for k, v in collections.Counter(a).items() ifv== 1]
print(out)

Output: [2, 5, 3]

EDIT:

Since you don't have a list at the beginning there are two ways, depending on the file size you should use the first for small enough files (otherwise you might run in memory problems) or the second one for large files.

Read file as list and use previous answer:

import collections

lines = [line for line inopen(infilename)]
out = [k for k, v in collections.Counter(lines).items() if v == 1]
with open(outfilename, 'w') as outfile:
    for o inout:
        outfile.write(o)

The first line reads your file completely as a list. This means, that really large files would be loaded in your memory. If you have to large files you can go ahead and use a sort of "blacklist":

Using blacklist:

lines_seen = set() # holds lines already seen
blacklist = set()
outfile = open(outfilename, "w")
for line inopen(infilename, "r"):
    if line notin lines_seen and line notin blacklist: # not a duplicate
        lines_seen.add(line)
    else:
        lines_seen.discard(line)
        blacklist.add(line)
for l in lines_seen:
    outfile.write(l)
outfile.close()

Here you add all lines to the set and only write the set to the file at the end. The blacklist remembers all multiple occurrences and therefore you do not write multiple lines even once. You can't do it in one go, to read and write since you do not know, if there comes the same line a second time. If you have further information (like multiple lines always come continuously) you could maybe do it differently

EDIT 2

If you want to do it depending on the first part:

firsts_seen =set()
lines_seen =set() # holds lines already seen
blacklist =set()
outfile =open(outfilename, "w")
for line inopen(infilename, "r"):
    first= line.split(',')[0]
    if firstnotin firsts_seen andfirstnotin blacklist: # not a duplicate
        lines_seen.add(line)
        firsts_seen.add(first)
    else:
        lines_seen.discard(line)
        firsts_seen.discard(first)
        blacklist.add(first)
print(len(lines_seen))
for l in lines_seen:
    outfile.write(l)
outfile.close()

P.S.: By now I have just been adding code, there might be a better way

For example with a dict:

lines_dict = {}
for line inopen(infilename, 'r'):
    if line.split(',')[0] notin lines_dict:
        lines_dict[line.split(',')[0]] = [line]
    else:
        lines_dict[line.split(',')[0]].append(line)
withopen(outfilename, 'w') as outfile:
    for key, valuein lines_dict.items():
        iflen(value) == 1:
            outfile.write(value[0])

Solution 2:

Given your input you can do something like this:

seen = {} # key maps to index
double_seen = set()

withopen('input.txt') as f:
    for line in f:
        _, key = line.split(':')
        key = key.strip()
        if key notin seen: # Have not seen this yet?
            seen[key] = line # Then add it to the dictionaryelse:
            double_seen.add(key) # Else we have seen this more thane once# Now we can just write back to a different filewithopen('output.txt', 'w') as f2:
    for key inset(seen.keys()) - double_seen:
        f2.write(seen[key])

Input I used:

line 1 :Messiline 2 :Messiline 3 :CR7

Output:

line 3 : CR7

Note this solution assumes Python3.7+ since it assumes dictionaries are in insertion order.

Post a Comment for "How To Remove Duplicate Lines From A Text File And The Unique Related To This Duplicate"