Python: avoid large list comprehensions

#python #comprehensions #generators #generatorexpressions

As is well known, Python list comprehensions work faster than loops. However, there are situations when they can seriously damage your program's performance or even lead to a memory crash. In these cases, you might want to consider using generator expressions instead.

Syntactically, these two are very similar. The only difference between them is that you declare list comprehensions with [] and generator expressions with (), just like this:

list_compr = [x**2 for x in range(10)]
gen_expr = (x**2 for x in range(10))

The key point is that a list comprehension is evaluated where it occurs. As soon as we define a list comprehension in the interactive shell, we'll get the result list:

>>> [x**2 for x in range(10)]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

A generator expression, in contrast, will return a generator object:

>>> (x**2 for x in range(10))
<generator object <genexpr> at 0x0000023DD840F7C8>

To get it working, we'll need to either use the next() method, iterate over a generator expression or use methods like list(), set(), or tuple().

In the examples above, using list comprehensions is actually more preferrable. When memory is not an issue, they outperform generator expressions.

The problem arises when you need to handle really large amounts of data, because list comprehensions store all their output in memory at once (as we've just seen in our code).

A generator, in contrast, is a concept designed to produce (yield) results one at a time instead of loading the entire data structure into memory. That makes possible working with huge datasets without a risk of blowup in memory usage.

It's also woth mentioning that generator expressions are especially useful with functions like sum(), min(), and max().

They also have another benefit. Generator expressions can easily be chained (composed) together, thus creating a data pipeline that can process massive amounts of data item by item.

However, they're not well suited for cases where you need to use the values more than once, because once a generator is exhausted, you can't access the values it produced.

Let's perform a small benchmark test and also see how generator expressions can be chained together. We're going to:
1) count the length of each line in a file,
2) then extract a square root from each line length,
3) sum up the square roots.
Of course, this example will be more instructional.

I'm going to use 'The Adventures of Sherlock Holmes' in a .txt format, which you can download here if you'd like to practice yourself. You can choose some other text file.

Let's use list comprehensions first:

import time

execution = []
for i in range(100):
    start = time.time()
    filename = 'Sherlock Holmes.txt'
    lengths = [len(line) for line in open(filename)]
    roots = [x**0.5 for x in lengths]
    print(sum(roots))
    end = time.time()
    execution.append(end-start)
print(f'Avg execution time with list comprehensions: '
      f'{sum(execution)/len(execution):.5f}')

With generator expressions, this code will look pretty much the same:

import time

execution = []
for i in range(100):
    start = time.time()
    filename = 'Sherlock Holmes.txt'
    lengths = (len(line) for line in open(filename))
    roots = (x**0.5 for x in lengths)
    print(sum(roots))
    end = time.time()
    execution.append(end-start)
print(f'Avg execution time with generator expressions: '
      f'{sum(execution)/len(execution):.5f}')

The second chunk will be run differently. Instead of producing the entire result on each line, Python will read one line from the file, then measure its length, and add it to the sum. Then the interpreter will proceed to the next line, and so on. This is exactly what chaining (composing) generator expressions into a data pipeline means.

So, let's run both versions:

Avg execution time with list comprehensions: 0.00486
Avg execution time with generator expressions: 0.00530

As you can see, on this dataset, list comprehensions run faster. But if I run the same code on a file that contains over 10 million lines on my machine, the outcomes will be different (I just copy-pasted the text in my Sherlock Holmes many times over):

Avg execution time with list comprehensions: 4.26855
Avg execution time with generator expressions: 3.79113

As you can see, in this case, generator expressions have beaten comprehensions :)

CONCLUSION

To sum up, generator expressions are more efficient when working with large datasets and can help your program avoid crashing. What's more, they're easily chained together, thus, creating data pipelines able to generate result values one by one.

At the same time, on smaller data (how much smaller - depends on the machine), list comprehensions usually outperform generator expressions, and they can be more useful if you need to access the produced data more than once.

If you feel like learning something else, check my previous post where I tell when it's good to use deques instead of lists.

Connect me on LinkedIn.