Python guide to using generators more

#python #generator #list

When writing a function that's supposed to return a sequence of results, returning some data structure like a list or dictionary is often the most obvious choice, yet not necessarily the most effective one. Sometimes, it can be beneficial to design your functions as generators.

A QUICK REMINDER

If you know the basics, you can skip to the next section. To put it simply, a generator is a function that returns its result values one by one. This function is created using the keyword yield instead of return.

When reached, this keyword will pass back one result value, the function will remember its state and sleep until resumed to produce the subsequent value.

In fact, generators don't return the results themselves. If you just call them like a normal function, all you'll get is a generator object:

>>> def generator_func(seq):
        for num in seq:
            yield num   
>>> generator_func([1, 2, 3])
<generator object generator_func at 0x0000022BE0AA4848>

To get it working, you need to iterate over it in a for loop or using methods like next(), list(), set(), tuple().

>>> gen = generator_func([1, 2, 3])
>>> next(gen)
1

Another important feature is that generators exhaust themselves and raise a StopIteration exception when they run out of values to return.

>>> for remaining in gen:
        print(remaining)    
2
3

We only got 2 and 3, because 1 has already been yielded. Here, we don't see any exceptions, because for loops handle them under the hood. So do methods like list(), tuple(), sum(). The next() method doesn't, though:

>>> next(gen)
Traceback (most recent call last):
  File "<pyshell#16>", line 1, in <module>
    next(gen)
StopIteration

This particular generator is exhausted. To use the function again, you need a new generator object, which means you'll have to call the function again. Now, let's move on to why use generators at all.

WHY DESIGN A FUNCTION AS A GENERATOR?

There a few reasons to write your function as a generator rather than a normal return function:

Achieving better design clarity and readability (and sometimes, but not always, making it shorter).
A generator feeds the results to the calling code one by one instead of storing the entire data structure in memory (like lists do). In some cases, this can be a deciding factor (read more about it in my previous post.)
Separating processing data and using the results. Instead of compelling a function to interact with the output and return it, you separate functionality. A generator produces the data, and the calling code or some other function uses it. By the way, it's called decoupling interfaces, and it can boost your code's reusability.

To clarify the last point, let's imagine you wrote a function that produces a list, returns it to a calling function, which continues torturing the poor list :)

In case you need to make some future changes to your code, or even toss away the long-suffering list, because now you need a dictionary, you'll have to adjust all your code to the new reality.

However, if you design the processing function as a generator - all you return now is a generator object, which you may iterate over or turn into a data structure using methods like list(), set(), tuple(). And when you need to change something, you'll only be changing the calling code, while the generator can be left in peace.

PRACTICAL EXAMPLE

Let's see how it can be done in practice. Say we need to analyze the vocabulary of some text (you can use mine). Tasks like this can get pretty complicated, but we'll keep our code simple to focus on the issue at hand:

from collections import defaultdict
from pprint import pprint

def analyze_text(file: 'I/O',) -> dict:
    """ 
    Search text data for words according to a pattern,
    then add found words to a defauldict, and count their frequency.
    """
    frequency = defaultdict(int)
    pattern = ' \n\'",.;!?()@#$%^&*`~'
    for line in file:
        for word in line.split(' '):
            word = word.strip(pattern).lower()
            if word.isalpha():
                frequency[word] += 1
    return dict(frequency)

if __name__ == '__main__':
    with open('Sherlock Holmes.txt', 'r') as file:
        frequency = analyze_text(file)
    pprint(frequency)

This piece of code is not very flexible. Suppose, you'd want not only to do frequency analysis, but some further processing as well. In this case, you'll have to rewrite both the text_analyzer and the calling code. In real life, a few changes like that can lead to some messy code. Not to mention that if our file was much bigger, our current code would lead to a memory crash.

With a couple of minor changes, we can make our code more adjustable:

from collections import defaultdict
from pprint import pprint

def analyze_text(file: 'I/O', 
                 pattern = ' \n\'",.;!?()@#$%^&*`~') -> str:
    """ 
    Search text data for words according to a pattern, then
    yield found words.
    """
    for line in file:
        for word in line.split(' '):
            word = word.strip(pattern).lower()
            if word.isalpha():
                yield word

if __name__ == '__main__':
    frequency = defaultdict(int)
    with open('Sherlock Holmes.txt', 'r') as file:
        for word in analyze_text(file):
            frequency[word] += 1  # do the frequency counting here
    pprint(dict(frequency))

Now, the calling code and analyze_text do two completely different jobs. analyze_text breaks down any passed-in text file and returns each word one by one. It doesn't care what you're going to do with these words after that, it'll be the job of the calling code.

What's more, the second version is very well suited for working with processing huge amounts of data, because all the working memory a generator function requires is the maximum length of one line of input.

Interested to learn more advanced stuff on generators? Check out my previous post.

Hope you enjoyed my post. If so, please, don't forget to like it :)