Skip to content Skip to sidebar Skip to footer

How To Read Specific Lines Of A Large Csv File

I am trying to read some specific rows of a large csv file, and I don't want to load the whole file into memory. The index of the specific rows are given in a list L = [2, 5, 15, 9

Solution 1:

A file doesn't have "lines" or "rows". What you consider a "line" is "what is found between two newline characters". As such you cannot read the nth line without reading the lines before it, as you couldn't count the newline characters.

Answer 1: if you consider your example, but with L=[9], unrolling your loops would give:

i=9row = (0, {'Col 2': 'row12', 'Col 3': 'row13', 'Col 1': 'row11'})

As you can see, row is a tuple with two members, calling row[i] means row[9], hence the IndexError.

Answer 2: This is very slow because you are reading the file up to the line number every time. In your example, you read the first 2 lines, then the first 5, then the first 15, then the first 98, etc. So you've read the first 5 lines 3 times. You could create a generator that only returns the lines you want (beware, line numbers would be 0-indexed):

defread_my_lines(csv_reader, lines_list):
    for line_number, row inenumerate(csv_reader):
        if line_number in lines_list:
            yield line_number, row

So when you want to process the lines, you would do:

L = [2, 5, 15, 98, ...]
withopen('~/file.csv') as f:
    r = csv.DictReader(f)
    for line_number, line inread_my_lines(r, L):
        do_something_with_line(line)

* Edit *

This could further be improved to stop reading the file when you've read all the lines you wanted:

defread_my_lines(csv_reader, lines_list):
    # make sure every line number shows up only once:
    lines_set = set(lines_list)
    for line_number, row inenumerate(csv_reader):
        if line_number in lines_set:
            yield line_number, row
            lines_set.remove(line_number)
            # Stop when the set is emptyifnot lines_set:
                raise StopIteration

Solution 2:

Assuming L is a list containing the line numbers you want, you could do :

withopen("~/file.csv") as f:
    r = csv.DictReader(f)
    for i, line inenumerate(r):
        if i in L:    # or (i+2) in L: from your second exampleprint line

That way :

  • you read the file only once
  • you do not load the whole file in memory
  • you only get the lines you are interested in

The only caveat is that you read whole file even if L = [3]

Solution 3:

for row in enumerate(r):

will pull tuples. You are then trying to select your ith element from a 2 element tuple.

for example

>> for i in enumerate({"a":1, "b":2}): print i
(0, 'a')
(1, 'b')

Additionally, since dictionaries are hash tables, your initial order is not necessarily preserved. for instance:

>>list({"a":1, "b":2, "c":3, "d":5})
['a', 'c', 'b', 'd']

Solution 4:

Just to sum up the great ideas, I ended up using something like this: L can be sorted relatively quickly, and in my case it was actually already sorted. So, instead of several membership checks in L it pays off to sort it and then only check each index against the first entry of it. Here is my piece of code:

count=0withopen('~/file.csv') as f:
    r = csv.DictReader(f)
    for row in r:
        count += 1if L == []:
            breakelif count == L[0]:
            print (row)
            L.pop(0)

Note that this stops as soon as we've gone through L once.

Post a Comment for "How To Read Specific Lines Of A Large Csv File"