Jerry Ng

Posted on Sep 13, 2021 • Edited on Feb 13, 2022 • Originally published at jerrynsh.com

Using Generators in Python: The Why, The What, and The When

#python #programming #tutorial #webdev

Today, “what are Generators in Python” and “what are Generators used for in Python” are some of the most popular Python interview questions.

Often, Generator is considered as one of the more intermediate concepts in Python. If you are new to learning Python, you may not have come across Generator before. Here’s a tip, it has something to do with the use of yield statements inside a function.

In this post, I am going to highlight some of the use cases, reasons, and advantages of using Generators in Python. In short, you should consider using Generators when dealing with large datasets with memory constraints.

Let’s dive a little bit deeper, shall we?

TL;DR

Consider using Generator when dealing with a huge dataset
Consider using Generator in scenarios where we do not need to reiterate it more than once
Generators give us lazy evaluation
They are a great way to generate sequences in a memory-efficient manner

Why Should I Care About Using Generators

Memory constraints

To understand why you should use Generators, we have to first understand that computers have a finite amount of memory (RAM). Whenever we are storing or manipulating variables, lists, etc., all that is being stored inside our memory.

You might ask, why do computer programs store them in memory? Because it’s the fastest way for us to write and retrieve data.

Scenarios

Have you ever had to work with a list so large that you run into MemoryError? Perhaps, you have tried reading rows from a super large Excel (or .csv) file.

All I remember was that performing these tasks is painfully slow or impossible.

What Is a Generator Function

To put it simply, a Generator function is a special kind of function that returns many items. The point here is that the items are returned one by one rather than all at once.

The main difference between a regular function and a Generator function lies in the use of return and yield statements in Python.

Generators give you lazy evaluation

You may have come across this statement. But, what does it really mean?

If you are familiar with Iterator, a Generator function is essentially a function that behaves just like that.

Behind the scene, Generators don’t compute the value of each item when being instantiated. Rather, they compute it when we ask for it. This is what people mean by Generators give you lazy evaluation.

As a result, Generators allow us to process and deal with one value at a time without having to load everything in memory first.

When and Where Should I Use Generators

Generators are great when you encounter problems that require you to read from a large dataset. Reading from a large dataset means our computer or server would have to allocate memory for it.

The only condition to remember is that a Generator can only be iterated once. In other words, as long as we do not need the previous value from our dataset, we can always use Generator.

Reading sizable CSV

Another common use case of using Generators is when we are working with large files such as Excel or CSV documents. Without using a Generator function, here’s how we can write it:

# Example of using a regular function
import csv

def read_csv_from_regular_fn():
    with open('large_dataset.csv', 'r') as f:
        reader = csv.reader(f)
        return [row for row in reader]

result_1 = read_csv_from_regular_fn()

# Output:
# [['a','b','c', ... ], ['x','y','z', ... ] ... ]

Upon running the example above, we may experience some slowness or even MemoryError depending on our computers.

Looking at the code example above, to generate the result, the read_csv_from_regular_fn would open our CSV file and loads everything in memory in an instance.

This is not a good solution when working with larger files than our available memory. Alternatively, we could do this:

# Example of using a Generator function
import csv

def read_csv_from_generator_fn():
    with open('large_dataset.csv', 'r') as f:
        reader = csv.reader(f)
        for row in reader:
            yield row

# To get the same output as result_1,
# We generate a list using our newly created Generator function:
result_2 = [row for row in read_csv_from_generator_fn()]

# Output same as result_1:
# [['a','b','c', ... ], ['x','y','z', ... ] ... ]

In this scenario, we use read_csv_from_generator_fn as our Generator function. This new Generator opens our large CSV file, loops through every row, and yields each row at a time rather than all at once.

Here, we would not run into any MemoryError or even any slowness due to memory constraints when reading data from our large_dataset.csv.

To check the memory usage in bytes, we could do the following:

import sys

print(sys.getsizeof(read_csv_from_generator_fn())) # 112 bytes
print(sys.getsizeof(read_csv_from_regular_fn())) # 1624056 bytes

Iterating through a large list (array)

Another example where Generators are often used is where we intend to process values from a large list:

# Example 1
nums_list_comprehension = [i * i for i in range(100_000_000)]

sum(nums_list_comprehension) # 333333328333333350000000

Depending on your computer, you may encounter MemoryError or at least a couple of seconds of slowness when evaluating the expression above.

Like list comprehensions, the Generator expression allows us to quickly create a Generator object without having to use the yield statement.

To cope with our memory constraint, we could turn the code example above into a Generator expression. This line of code below evaluates almost immediately:

# Example 2
nums_generator = (i \* i for i in range(100_000_000))
# <generator object <genexpr> at 0x106ecc580>

sum(nums_generator) # 333333328333333350000000

In Example 1, i ** i for the entire range of 100_000_000 is being evaluated and stored in memory beforehand. It returns a full list.

In Example 2, i ** i is only evaluated when being iterated, one at a time. It returns a Generator expression.

Remember, Generators don’t compute the value of each item when being instantiated.

The differences in memory usage are below:

import sys

print(sys.getsizeof(nums_generator)) # 112 bytes
print(sys.getsizeof(nums_list_comprehension)) # 835128600 bytes

When NOT To Use Generators

We need the previous values

A Generator can only be iterated once.
The example below shows that the Generator expression from nums_generator can only be iterated once. Using sum on it for the second time resulted in zero as the Generator was exhausted.

# Continuing from Example 2
sum(nums_generator) # 333333328333333350000000
sum(nums_generator) # 0, because it can only be iterated once.

Dealing with relatively small files

When dealing with relatively small files or lists, we may not want to use Generator as it might actually slow us down.

We can use our previous examples cProfile to profile the performance differences between list comprehension and Generator expression when summing the values up.

cProfile of summing using List Comprehension vs. Generator Expression:

# List Comprehension
# ------------------
cProfile.run('sum([i * i for i in range(100_000_000)])')

#    5 function calls in 13.956 seconds
#    Ordered by: standard name
#    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
#         1    8.442    8.442    8.442    8.442 <string>:1(<listcomp>)
#         1    0.841    0.841   13.956   13.956 <string>:1(<module>)
#         1    0.000    0.000   13.956   13.956 {built-in method builtins.exec}
#         1    4.672    4.672    4.672    4.672 {built-in method builtins.sum}
#         1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

# Generator Expression
# --------------------
cProfile.run('sum((i * i for i in range(100_000_000)))')

#    100000005 function calls in 22.996 seconds
#    Ordered by: standard name
#    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
# 100000001   11.745    0.000   11.745    0.000 <string>:1(<genexpr>)
#         1    0.000    0.000   22.996   22.996 <string>:1(<module>)
#         1    0.000    0.000   22.996   22.996 {built-in method builtins.exec}
#         1   11.251   11.251   22.996   22.996 {built-in method builtins.sum}
#         1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

From our cProfile result above, we can tell that using list comprehension is a lot faster provided we don’t run into memory constraints.

Evidently, if memory is not an issue, we should stick with using regular functions or list comprehensions.

Conclusion

In summary, Generator is an amazing tool in Python given the scenario where we do not need to reiterate it more than once.

As Generators give us lazy evaluation, they are a great way to generate sequences in a memory-efficient manner. We should definitely consider using Generator when dealing with huge datasets to optimize our program.

Thank you for reading!

This article was originally published at jerrynsh.com