Skip to content Skip to sidebar Skip to footer

What Is The Best Way To Sort A Sequence In Python?

I am trying to sort the table based on certain conditions that need to happen in a row. Simplified version of a table: Number Time 1 23 2 45 3 67 4 23 5

Solution 1:

You don't have to sort the data at all. A simple solution might be:

defrow_grouper(reader):
    iterrows = iter(reader)
    current = [next(iterrows) for _ inrange(5)]
    for next_row in iterrows:
        yield current
        current.pop(0)
        current.append(next_row)


reader = csv.reader(open(filename))

for i, row_group inenumerate(row_grouper(reader)):
    ifall(float(row[1]) < 40for row in row_group):
        print i, i+5#i is the index of the first row in the group.break#stop processing other rows.

The row_grouper function is a generator that yields 5-element lists of consecutive rows. Every time it removes the first row of the group and adds the new row at the end.


Instead of a plain list you can use a deque and replace the pop(0) in row_grouper with a popleft() call which is more efficient, although this doesn't matter much if the list has only 5 elements.

Alternatively you can use martineau suggestion and use the maxlen keyword argument and avoid poping. This is about twice as fast as using a deque's popleft, which is about twice as fast as using the list's pop(0).


Edit: To check more than one condition you can modify use more than one row_grouper and use itertools.tee to obtain copies of the iterables.

For example:

import itertools as it

defcheck_condition(group, row_index, limit, found):
    if group isNoneor found:
        returnFalsereturnall(float(row[row_index]) < limit for row in group)


f_iter, s_iter, t_iter = it.tee(iter(reader), 3)

groups = row_grouper(f_iter, 10), row_grouper(s_iter, 5), row_grouper(t_iter, 25)

found_first = found_second = found_third = Falsefor index, (first, second, third) inenumerate(it.izip_longest(*groups)):
    if check_condition(first, 1, 40, found_first):
        #stuff
        found_first = Trueif check_condition(second, 3, 40, found_second):
        #stuff
        found_second = Trueif check_condition(third, 3, 40, found_third): 
        # stuff
        found_third = Trueif found_first and found_second and found_third:
        #stop the code if we matched all the conditions once.break

The first part simply imports itertools(and assigns an "alias" it to avoid typing itertools every time).

I've defined the check_condition function, since the conditions are getting more complicated and you don't want to repeat them over and over. As you can see the last line of check_condition is the same as the condition before: it checks if the current "row group" verifies the property. Since we plan to iterate over the file only once, and we cannot stop the loop when only one condition is met(since we'd miss the other conditions) we must use some flag that tells us if the condition on (e.g.) time was met before or not. As you can see in the for loop, we break out of the loop when all the conditions are met.

Now, the line:

f_iter, s_iter, t_iter = it.tee(iter(reader), 3)

Creates an iterable over the rows of reader and makes 3 copies of it. This means that the loop:

forrowin f_iter:
    print(row)

Will print all the rows of the file, just like doing for row in reader. Note however that itertools.tee allows us to obtain copies of the rows without reading the file more than once.

Afterwards, we must pass these rows to the row_grouper in order to verify the conditions:

groups = row_grouper(f_iter, 10), row_grouper(s_iter, 5), row_grouper(t_iter, 25)

Finally we have to loop over the "row groups". To do this simultaneously we use itertools.izip_longest (renamed to itertools.zip_longest (without i) in python3). It works just like zip, creating pairs of elements (e.g. zip([1, 2, 3], ["a", "b", "c"]) -> [(1, "a"), (2, "b"), (3, "c")]). The difference is that izip_longestpads the shorter iterables with Nones. This assures that we check the conditions on all the possible groups(and that's also why check_condition has to check if group is None).

To obtain the current row index we wrap everything in enumerate, just like before. Inside the for the code is pretty simple: you check the conditions using check_condition and, if the condition is met you do what you have to do and you have to set the flag for that condition(so that in the following loops the condition will always be False).

(Note: I must say I did not test the code. I'll test it when I have a bit of time, anyway I hope I gave you some ideas. And check out the documentation for itertools).

Solution 2:

You don't really need to sort your data, just keep track of whether the condition you're looking for has occurred in the last N rows of data. Fixed-size collections.deques are good for this sort of thing.

import csv
from collections import deque
filename = 'table.csv'
GROUP_SIZE = 5
THRESHOLD = 40
cond_deque = deque(maxlen=GROUP_SIZE)

withopen(filename) as datafile:
    reader = csv.reader(datafile) # assume delimiter=','
    reader.next() # skip header rowfor linenum, row inenumerate(reader, start=1):  # process rows of file
        col0, col1, col4, col5, col6, col23, col24, col25 = (
            float(row[i]) for i in (0, 1, 4, 5, 6, 23, 24, 25))
        cond_deque.append(col1 < THRESHOLD)
        if cond_deque.count(True) == GROUP_SIZE:
            print'lines {}-{} had {} consecutive rows with col1 < {}'.format(
                linenum-GROUP_SIZE+1, linenum, GROUP_SIZE, THRESHOLD)
            break# found, so stop looking

Post a Comment for "What Is The Best Way To Sort A Sequence In Python?"