What Is The Best Way To Sort A Sequence In Python?
Solution 1:
You don't have to sort the data at all. A simple solution might be:
defrow_grouper(reader):
iterrows = iter(reader)
current = [next(iterrows) for _ inrange(5)]
for next_row in iterrows:
yield current
current.pop(0)
current.append(next_row)
reader = csv.reader(open(filename))
for i, row_group inenumerate(row_grouper(reader)):
ifall(float(row[1]) < 40for row in row_group):
print i, i+5#i is the index of the first row in the group.break#stop processing other rows.
The row_grouper
function is a generator that yields 5-element lists of consecutive rows. Every time it removes the first row of the group and adds the new row at the end.
Instead of a plain list
you can use a deque
and replace the pop(0)
in row_grouper
with a popleft()
call which is more efficient, although this doesn't matter much if the list has only 5 elements.
Alternatively you can use martineau suggestion and use the maxlen
keyword argument and avoid pop
ing. This is about twice as fast as using a deque's popleft, which is about twice as fast as using the list
's pop(0)
.
Edit: To check more than one condition you can modify use more than one row_grouper
and use itertools.tee
to obtain copies of the iterables.
For example:
import itertools as it
defcheck_condition(group, row_index, limit, found):
if group isNoneor found:
returnFalsereturnall(float(row[row_index]) < limit for row in group)
f_iter, s_iter, t_iter = it.tee(iter(reader), 3)
groups = row_grouper(f_iter, 10), row_grouper(s_iter, 5), row_grouper(t_iter, 25)
found_first = found_second = found_third = Falsefor index, (first, second, third) inenumerate(it.izip_longest(*groups)):
if check_condition(first, 1, 40, found_first):
#stuff
found_first = Trueif check_condition(second, 3, 40, found_second):
#stuff
found_second = Trueif check_condition(third, 3, 40, found_third):
# stuff
found_third = Trueif found_first and found_second and found_third:
#stop the code if we matched all the conditions once.break
The first part simply imports itertools
(and assigns an "alias" it
to avoid typing itertools
every time).
I've defined the check_condition
function, since the conditions are getting more complicated and you don't want to repeat them over and over. As you can see the last line of check_condition
is the same as the condition before: it checks if the current "row group" verifies the property. Since we plan to iterate over the file only once, and we cannot stop the loop when only one condition is met(since we'd miss the other conditions) we must use some flag that tells us if the condition on (e.g.) time was met before or not. As you can see in the for
loop, we break
out of the loop when all the conditions are met.
Now, the line:
f_iter, s_iter, t_iter = it.tee(iter(reader), 3)
Creates an iterable over the rows of reader
and makes 3 copies of it.
This means that the loop:
forrowin f_iter:
print(row)
Will print all the rows of the file, just like doing for row in reader
.
Note however that itertools.tee
allows us to obtain copies of the rows without reading the file more than once.
Afterwards, we must pass these rows to the row_grouper
in order to verify the conditions:
groups = row_grouper(f_iter, 10), row_grouper(s_iter, 5), row_grouper(t_iter, 25)
Finally we have to loop over the "row groups". To do this simultaneously we use itertools.izip_longest
(renamed to itertools.zip_longest
(without i
) in python3).
It works just like zip
, creating pairs of elements (e.g. zip([1, 2, 3], ["a", "b", "c"]) -> [(1, "a"), (2, "b"), (3, "c")]
). The difference is that izip_longest
pads the shorter iterables with None
s. This assures that we check the conditions on all the possible groups(and that's also why check_condition
has to check if group
is None
).
To obtain the current row index we wrap everything in enumerate
, just like before.
Inside the for
the code is pretty simple: you check the conditions using check_condition
and, if the condition is met you do what you have to do and you have to set the flag for that condition(so that in the following loops the condition will always be False
).
(Note: I must say I did not test the code. I'll test it when I have a bit of time, anyway I hope I gave you some ideas. And check out the documentation for itertools
).
Solution 2:
You don't really need to sort your data, just keep track of whether the condition you're looking for has occurred in the last N rows of data. Fixed-size collections.deque
s are good for this sort of thing.
import csv
from collections import deque
filename = 'table.csv'
GROUP_SIZE = 5
THRESHOLD = 40
cond_deque = deque(maxlen=GROUP_SIZE)
withopen(filename) as datafile:
reader = csv.reader(datafile) # assume delimiter=','
reader.next() # skip header rowfor linenum, row inenumerate(reader, start=1): # process rows of file
col0, col1, col4, col5, col6, col23, col24, col25 = (
float(row[i]) for i in (0, 1, 4, 5, 6, 23, 24, 25))
cond_deque.append(col1 < THRESHOLD)
if cond_deque.count(True) == GROUP_SIZE:
print'lines {}-{} had {} consecutive rows with col1 < {}'.format(
linenum-GROUP_SIZE+1, linenum, GROUP_SIZE, THRESHOLD)
break# found, so stop looking
Post a Comment for "What Is The Best Way To Sort A Sequence In Python?"