Skip to content Skip to sidebar Skip to footer

How To Filter Overlap Rows In A Big File In Python

I am trying to filter overlap rows in a big file in python.The overlap degrees is set to 25%. In other words,the number of element of intersection between any two rows is less than

Solution 1:

The tricky part is that you have to modify the list you're iterating over and still keep track of two indices. One way to do that is to go backwards, since deleting an item with index equal to or larger than the indices you keep track of will not influence them.

This code is untested, but you get the idea:

withopen("file.txt") as fileobj:
    sets = [set(line.split()) for line in fileobj]
    for first_index inrange(len(sets) - 2, -1, -1):
        for second_index inrange(len(sets) - 1, first_index, -1):
            union = sets[first_index] | sets[second_index]
            intersection = sets[first_index] & sets[second_index]
            iflen(intersection) / float(len(union)) > 0.25:
                del sets[second_index]
withopen("output.txt", "w") as fileobj:
    for set_ in sets:
        # order of the set is undefined, so we need to sort each set
        output = " ".join(sorted(set_, key=lambda x: int(x[1:])))
        fileobj.write("{0}\n".format(output))

Since it's obvious how to sort the elements of each line we could do it like this. If the order was somehow custom, we'd have to couple the read line with each set element so that we could write back exactly the line that was read at the end, instead of regenerating it.

Post a Comment for "How To Filter Overlap Rows In A Big File In Python"