AJ Kerrigan

Posted on Jan 2, 2020 • Edited on Feb 14, 2020

The Iterator that Wasn't: A Love Letter to Boto3 Paginators

#python #boto3 #aws

This post originally appeared on an internal company blog, and is adapted here with permission.

Boto3 is the Python SDK for AWS. It does a great job making AWS APIs feel Python-native. This post is a love letter to one particular feature - paginators.

Well, it's that or an excuse to have fun with the error message "'PageIterator' object is not an iterator"

APIs and Pagination in a Nutshell

When you need to fetch a list of objects from a service, most APIs will return that list in chunks if there is a large number of results. That's usually a good thing for everyone:

The API can get you a single page of results more quickly than it can send everything
You can start working sooner
If you wander off in the middle of processing, there's less wasted effort

Everybody wins!

Boto3 and Pagination - DIY Mode

When AWS APIs return a truncated response, they include a token that you can use to fetch the next page of results.

If you need to build a complete result set from paginated responses, one option is to handle everything yourself. For example, here's one way you could fetch spot instance pricing history for the first day of October 2019:



import boto3
from datetime import datetime

ec2 = boto3.client('ec2')
params = {
    'StartTime': datetime(2019,10,1),
    'EndTime': datetime(2019,10,1)
}

results = []

while params.get('NextToken') != '':
    response = ec2.describe_spot_price_history(**params)
    results.extend(response['SpotPriceHistory'])
    params['NextToken'] = response['NextToken']

It's not the cleanest code, but bear with me for a moment. Notice the dictionary keys you need to fiddle with manually:

NextToken comes back in a paginated response, and you feed it to a follow-up request to get another page of results
SpotPriceHistory is where you look in each response for the result data you care about

It can be easy to mix things up when handling those tokens, and to make things more interesting the exact paginator rules vary by service. See Ian McKay's excellent summary in iann0036/aws-pagination-rules to see just how varied those rules are.

Boto3 includes a helpful paginator abstraction that makes this whole process much smoother. To get a collection of EBS volumes for example, you might do something like this:



client = boto3.client('ec2')
paginator = client.get_paginator('describe_volumes')
vols = (vol for page in paginator.paginate()
        for vol in page['Volumes'])

The paginator abstraction helps paper over differences between services, and hides a bunch of details under a rug so you don't need to care about them.

My Confusion

At some point shortly after learning about boto3 paginators, I was experimenting with some code and wanted to look at just the first page of results. Seeing that paginate() returns a PageIterator, I figured I'd be able to do something like this:



page_iter = paginator.paginate()
first_page = next(page_iter)

But no! I got back this error message:



TypeError: PageIterator object is not an iterator

Wait... what?

So I had an object that was calling itself an iterator, and worked fine in a for loop. But it wouldn't work with next() or a subscript (such as page_iter[0]). As a workaround, I tried this:



page_iter = iter(paginator.paginate())
first_page = next(page_iter)

And... that worked. I didn't completely understand why it worked or was necessary in the first place. I just figured for some weird reason I had to do explicitly what the for loop was doing implicitly. This wasn't something I needed to do all that often, so I didn't give it too much thought.

The Cookbook Clarifies

Fast forwarding a bit, I read David Beazley's excellent Python Cookbook. Recipe 4.6 in the 3rd edition is called "Defining Generator Functions with Extra State". It succinctly describes what boto3's PageIterator is doing and why. The recipe opens with this:

Problem: You would like to define a generator function, but it involves extra state that you would like to expose to the user somehow. (Beazley, David; Jones, Brian K.. Python Cookbook: Recipes for Mastering Python 3 (p. 120). O'Reilly Media.)

As it turns out, that's exactly the problem PageIterator solves. It needs to:

Iterate through some data. In this case, pages from an API response.
Expose extra state and functionality to the user. For example, keep track of a resume token that the caller can use to resume interrupted pagination. Or offer a search method which can find entries matching a JMESPath expression across response pages.

There's some handy stuff going on under the covers of the PageIterator class! But in the normal case where you slap it into a for loop (or call iter() explicitly), it invokes the class's __iter__() "magic" method behind the scenes.

So perhaps bizarrely, PageIterator is an accurate name (it does let you iterate through pages) and a misleading one (it does not directly support Python's iterator protocol). It behaves more like an iterable:

Iterables can be used in a for loop and in many other places where a sequence is needed (zip(), map(), …). When an iterable object is passed as an argument to the built-in function iter(), it returns an iterator for the object. This iterator is good for one pass over the set of values. When using iterables, it is usually not necessary to call iter() or deal with iterator objects yourself. The for statement does that automatically for you, creating a temporary unnamed variable to hold the iterator for the duration of the loop. See also iterator, sequence, and generator.

Trey Hunner does a great job of talking through these concepts clearly. There are helpful sections in both The Iterator Protocol: How "For Loops" Work in Python and How to make an iterator in Python that are relevant for understanding PageIterator's behavior.

For anyone who made it this far and is interested in the actual code, PageIterator is defined here.

Wrap-Up

I originally wrote this post on an internal company blog a couple years ago. I've made some minor updates, but the content is largely the same. As I wrote at the time, it's possible that I only think I understand this stuff! Questions, comments and criticism are all welcome in the comments. Thanks for reading!

Most importantly, if you spot any outright nonsense in here please call me out (publicly or privately) so I can get it fixed before leading any innocent bystanders astray. Thank you.

DEV Community