Skip to content Skip to sidebar Skip to footer

Looking For A More Efficient Way To Reorganize A Massive Csv In Python

I've been working on a problem where I have data from a large output .txt file, and now have to parse and reorganize certain values in the the form of a .csv. I've already written

Solution 1:

You can try the UNIX sort utility:

sort -n -s -t, -k1,1 infile.csv > outfile.csv

-t sets the delimiter and -k sets the sort key. -s stabilizes the sort, and -n uses numeric comparison.

Solution 2:

If the CSV file would fit into your RAM (e.g. less than 2GB), then you can just read the whole thing and do a sort on it:

data = list(csv.reader(fn))
data.sort(key=lambda line:line[0])
csv.writer(outfn).writerows(data)

That shouldn't take nearly as long if you don't thrash. Note that .sort is a stable sort, so it will preserve the time order of your file when the keys are equal.

If it won't fit into RAM, you will probably want to do something a bit clever. For example, you can store the file offsets of each line, along with the necessary information from the line (timestamp and flight ID), then sort on those, and write the output file using the line offset information.

Post a Comment for "Looking For A More Efficient Way To Reorganize A Massive Csv In Python"