wrighter

Posted on Nov 9, 2021 • Originally published at wrighters.io on May 30, 2021

How to iterate over pandas DataFrame rows (and should you?)

#pandas #python

One of the most searched for (and discussed) questions about pandas is how to iterate over rows in a DataFrame. Often this question comes up right away for new users who have loaded some data into a DataFrame and now want to do something useful with it. The natural way for most programmers to think of what to do next is to build a loop. They may not understand the “correct” way to work with DataFrames yet, but even experienced pandas and NumPy developers will consider iterating over the rows of a DataFrame to solve a problem. Instead of trying to find the one right answer about iteration, it makes better sense to understand the issues involved and know when to choose the best solution.

As of this writing, the top voted question tagged with ‘pandas’ on Stack Overflow is about how to iterate over DataFrame rows. It also turns out that question has the most copied answer with a code block on the entire site. The Stack Overflow developers say thousands of people view the answer weekly and copy it to solve their problem. Obviously people want to iterate over DataFrame rows!

It is also true that there can be serious consequences with iterating over DataFrame rows using the top solution. Other answers to the question (especially the second highest rated answer) do a fairly good job of giving other options, but the entire list of 26 (and counting!) answers is extremely confusing. Instead of asking how to iterate over DataFrame rows, it makes more sense to understand what the options are that are available, what their advantages and disadvantages are, and then choose the one that makes sense for you. In some cases, the top voted answer for iteration might be the best choice!

But I have heard that iteration is wrong, is that true?

First, choosing to iterate over the rows of a DataFrame is not automatically the wrong way to solve a problem. However, in most cases what beginners are trying to do with iteration is better done with another approach. However, no one should ever feel bad about writing a first solution that uses iteration instead of other (perhaps better) ways. That’s often the best way to learn, you can think of a first solution as the first draft of your essay, you can improve it with some editing.

Now what do we want to do with the `DataFrame`?

Let’s start with basic questions. If we look at the original question on Stack Overflow, the question and answer just print the content of the DataFrame. First off, let’s all agree that this is not a good way to look at the content of a DataFrame. The standard rendering of a DataFrame , whether it is rendered with print or viewed with a Jupyter notebook using display or as an output in a cell will be far better than what would be printed using custom formatting.

If the DataFrame is large, only some columns and rows may be visible by default. Use head and tail to get a sense of the data. If you want to only look at subsets of a DataFrame, instead of using a loop to only display those rows, use the powerful indexing capabilities of pandas. With a little practice, you can select any combinations of rows or columns to show. Start there first.

Now instead of a trivial printing example, let’s look at ways to actually use data for a row in a DataFrame that includes some logic.

Example

Let’s build an example DataFrame to use. I’ll do this by making some fake data (using Faker). Note that the columns are different data types (we have some strings, an integer, and dates).

from datetime import datetime, timedelta

import pandas as pd
import numpy as np
from faker import Faker

fake = Faker()

today = datetime.now()
next_month = today + timedelta(days=30)
df = pd.DataFrame([[fake.first_name(), fake.last_name(),
                    fake.date_this_decade(), fake.date_between_dates(today, next_month),
                    fake.city(), fake.state(), fake.zipcode(), fake.random_int(-100,1000)]
                  for r in range(100)],
                  columns=['first_name', 'last_name', 'start_date',
                           'end_date', 'city', 'state', 'zipcode', 'balance'])

df['start_date'] = pd.to_datetime(df['start_date']) # convert to datetimes
df['end_date'] = pd.to_datetime(df['end_date'])

df.dtypes

first_name object
last_name object
start_date datetime64[ns]
end_date datetime64[ns]
city object
state object
zipcode object
balance int64
dtype: object

df.head()

  first_name last_name start_date end_date city state \
0 Katherine Moody 2020-02-04 2021-06-28 Longberg Maryland   
1 Sarah Merritt 2021-03-02 2021-05-30 South Maryborough Tennessee   
2 Karen Hensley 2020-02-29 2021-06-23 Brentside Missouri   
3 David Ferguson 2020-02-02 2021-06-14 Judithport Virginia   
4 Phillip Davis 2020-07-17 2021-06-04 Louisberg Minnesota   

  zipcode balance  
0 20496 493  
1 18495 680  
2 63702 427  
3 66787 587  
4 98616 211

A first attempt

Let’s say that our DataFrame contains customer data and we have a scoring function for customers that uses multiple customer attributes to give them a score between ‘A’ and ‘F’. Any customer with a negative balance is scored an ‘F’, above 500 is an ‘A’, and after that, logic depends on if a customer is a ‘legacy’ customer and what state they live in.

Note that I made doctests for this function, see my post on Jupyter unit testing for more details on how to unit test in Jupyter.

from dataclasses import dataclass

@dataclass
class Customer:
    first_name: str
    last_name: str
    start_date: datetime
    end_date: datetime
    city: str
    state: str
    zipcode: str
    balance: int

def score_customer(customer:Customer) -> str:
    """Give a customer a credit score.
    >>> score_customer(Customer("Joe", "Smith", datetime(2020, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, -5))
    'F'
    >>> score_customer(Customer("Joe", "Smith", datetime(2020, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 50))
    'C'
    >>> score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 50))
    'D'
    >>> score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 150))
    'C'
    >>> score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 250))
    'B'
    >>> score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Chicago", "Illinois", 66666, 350))
    'B'
    >>> score_customer(Customer("Joe", "Smith", datetime(2021, 1, 1), datetime(2023,1,1), "Santa Fe", "California", 88888, 350))
    'A'
    >>> score_customer(Customer("Joe", "Smith", datetime(2020, 1, 1), datetime(2023,1,1), "Santa Fe", "California", 88888, 50))
    'C'
    """
    if customer.balance < 0:
        return 'F'
    if customer.balance > 500:
        return 'A'
    # legacy vs. non-legacy
    if customer.start_date > datetime(2020, 1, 1):
        if customer.balance < 100:
            return 'D'
        elif customer.balance < 200:
            return 'C'
        elif customer.balance < 300:
            return 'B'
        else:
            if customer.state in ['Illinois', 'Indiana']:
                return 'B'
            else:
                return 'A'
    else:
        if customer.balance < 100:
            return 'C'
        else:
            return 'A'

import doctest
doctest.testmod()

TestResults(failed=0, attempted=8)

Scoring our customers

OK, now that we have a concrete example, how do we obtain the score for all of our customers? Let’s just go straight to the top answer from the Stack Overflow question, DataFrame.iterrows. This is a generator that returns the index for a row along with the row as a Series. If you aren’t familiar with what a generator is, you can think of it as a function you can iterate over. As a result, calling next on it will yield the first element.

next(df.iterrows())

(0,
 first_name Katherine
 last_name Moody
 start_date 2020-02-04 00:00:00
 end_date 2021-06-28 00:00:00
 city Longberg
 state Maryland
 zipcode 20496
 balance 493
 Name: 0, dtype: object)

This looks promising! This is a tuple containing the index of the first row and the row data itself. Maybe we can just pass it right into our function. Let’s try that out and see what happens. Even though the row is a Series, the columns are the same as the attributes of our Customer class, so we might be able to just pass this into our scoring function.

score_customer(next(df.iterrows())[1])

'A'

Wow, that seemed to work. Can we just score the entire table?

df['score'] = [score_customer(c[1]) for c in df.iterrows()]

Is this our best choice?

Wow, that seems too easy. You can see why this is the top voted answer, since it seems to do exactly what we want. Why would there be any controversy about this answer?

As is usually the case with pandas (and really with any software engineering question), picking an ideal solution depends on the inputs. Let’s summarize what the issues could be with various design choices. If the issues raised don’t fit your specific use case, iteration using iterrows may be a perfectly acceptable solution! I won’t judge you. I use it plenty of times, and will summarize at the end how to make decisions about the possible solutions.

The arguments for and against using iterrows can be grouped into the following categories.

Efficiency (Speed and Memory)
Mixed types in a row causing issues
Readability and maintainability

Speed and Memory

In general, if you want things to be fast in pandas (or Numpy, or any framework that offers vectorized calculations), you will not want to iterate through elements but instead choose a vectorized solution. However, even if the solution can be vectorized, it might be a lot of work for the programmer to do so, especially a beginner. Other answers to the question on Stack Overflow present a host of other solutions. They mostly all fall into one of the following categories, in the following order of preference for speed:

Vectorization
Cython routines
List comprehensions (vanilla for loop)
DataFrame.apply()
DataFrame.itertuples() and iteritems()
DataFrame.iterrows()

Vectorization

The main problem with always telling people to vectorize everything is that at times a vectorized solution may be a real chore to write, debug, and maintain. The examples given to prove that vectorization is preferred often show trivial operations, like simple multiplication. But since the example I started with in this article is not just a single calculation, I decided to write one possible vectorized solution to this problem.

def vectorized_score(df):
    return np.select([df['balance'] < 0,
                      df['balance'] > 500, # technically not needed, would fall through
                      ((df['start_date'] > datetime(2020,1,1)) &
                       (df['balance'] < 100)),
                      ((df['start_date'] > datetime(2020,1,1)) &
                       (df['balance'] >= 100) &
                       (df['balance'] < 200)),
                      ((df['start_date'] > datetime(2020,1,1)) &
                       (df['balance'] >= 200) &
                       (df['balance'] < 300)),
                      ((df['start_date'] > datetime(2020,1,1)) &
                       (df['balance'] >= 300) &
                       df['state'].isin(['Illinois', 'Indiana'])),
                      ((df['start_date'] >= datetime(2020,1,1)) &
                       (df['balance'] < 100)),
                     ], # conditions
                     ['F',
                      'A',
                      'D',
                      'C',
                      'B',
                      'B',
                      'C'], # choices
                     'A') # default score

assert (df['score'] == vectorized_score(df)).all()

There’s more than one way to do this, of course. I chose to use np.select (you can read more about it and other various ways to update DataFrames in my article on using where and mask.) I sort of like using np.select when you have multiple conditions like this, although it’s not extremely readable. We could have also done this using more code with vectorized updates for each step and made it much more readable. It would probably be similar in terms of speed.

I personally find this very unreadable, but maybe with some good comments it could be clearly explained to future maintainers (or my future self). But the reason we are doing vectorized code is to make this faster. How does performance look for our sample DataFrame?

%timeit vectorized_score(df)

2.75 ms ± 489 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Let’s also time our original solution.

%timeit [score_customer(c[1]) for c in df.iterrows()] 

13.5 ms ± 911 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

OK, so we’re almost 5x faster, just with our tiny dataset. This speedup wouldn’t be enough to matter for small sizes, but with big datasets a simple rewrite to get that much of a speedup makes sense. And I’m sure that a faster vectorized version could be written with a little thought and profiling applied to the situation. But hold on until the end to see what the performance looks like for larger datasets.

Cython

Cython is a project that makes it easy to write C extensions for Python using (mostly) Python syntax. I confess that I’m far from a Cython expert, but have found that even just a little bit of effort in Cython can make a Python code hotspot much faster. In this case, we have shown that we can make a vectorized solution, so using Cython in a non-vectorized solution would probably not be worth pursuing as a first choice. However, I did write a simple Cython version here and it was the fastest of the non-vectorized solutions at smaller sized inputs, even with just a tiny bit of effort. Especially for cases where there is a lot of calculation done per row that can’t be vectorized, using Cython might be a great choice, but will require an investment in time.

List comprehensions

Now the next option is a little different. I admit that I don’t think I’ve used this technique often. The idea here is to use a list comprehension, invoking your function with each element in your DataFrame. Note that I did use a list comprehension already in our first solution, but it was along with iterrows. This time instead of using iterrows, the data is pulled out of each column in the DataFrame directly and then iterated over. No Series is created in this case. If your function has multiple arguments, you can use zip to make tuples of the arguments, passing in the columns in your DataFrame to match the argument order. Now to do this, I’ll need a modified scoring function, since I don’t have already constructed Customer objects in my DataFrame, and creating them just to invoke the function would add another layer. I only use three attributes of the customer, so here’s a simple rewrite.

def score_customer_attributes(balance:int, start_date:datetime, state:str) -> str:
    if balance < 0:
        return 'F'
    if balance > 500:
        return 'A'
    # legacy vs. non-legacy
    if start_date > datetime(2020, 1, 1):
        if balance < 100:
            return 'D'
        elif balance < 200:
            return 'C'
        elif balance < 300:
            return 'B'
        else:
            if state in ['Illinois', 'Indiana']:
                return 'B'
            else:
                return 'A'
    else:
        if balance < 100:
            return 'C'
        else:
            return 'A'

And here’s what the first loop of the list comprehension will look like when calling the function.

next(zip(df['balance'], df['start_date'], df['state']))

(493, Timestamp('2020-02-04 00:00:00'), 'Maryland')

We will now build a list of all the scores for the entire DataFrame.

df['score3'] = [score_customer_attributes(*a) for a in zip(df['balance'], df['start_date'], df['state'])]
assert (df['score'] == df['score3']).all()

Now how fast is this?

%timeit [score_customer_attributes(*a) for a in zip(df['balance'], df['start_date'], df['state'])]

171 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Wow, that’s much faster, over 70x faster than the original for this data. By just taking the raw data and invoking a simple Python function, the scores are all calculated quickly in Python space. No row conversions to Series need to take place.

Note that we could also invoke our original function, we’d just have to make a Customer object to pass in. This is a bit uglier, but still quite fast.

%timeit [score_customer(Customer(first_name='', last_name='', end_date=None, city=None, zipcode=None, balance=a[0], start_date=a[1], state=a[2])) for a in zip(df['balance'], df['start_date'], df['state'])]

254 µs ± 2.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

DataFrame.apply

We can also use DataFrame.apply. Note that to apply this to rows, you need to pass in the correct axis since it defaults to applying to each column. The axis argument here is specifying which index you want to have in the object passed to your function. We want each object to be a customer row, with the columns as the index.

assert (df.apply(score_customer, axis=1) == df['score']).all()

%timeit df.apply(score_customer, axis=1)

3.57 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The performance here is better than our original, over 3x faster. This is also very readable, and allows us to use our easy to read and maintain original function. It’s still slower than the list comprehension though because it is constructing a Series object for each row.

DataFrame.iteritems and DataFrame.itertuples

Now we will look at the regular iteration methods in more detail. There are three iter functions available for DataFrames: iteritems, itertuples, and iterrows. DataFrames also support iteration directly, but these functions don’t all iterate over the same things. Since understanding what all these methods do by just seeing their names can be really confusing, let’s review them all here.

iter(df) (calls the DataFrame. __iter__ method). Iterate over the info axis, which for DataFrames is the column names, not the values.

next(iter(df)) # 'first_name'

'first_name'

iteritems. Iterate over the columns, returning a tuple of column name and the column as a Series.

next(df.iteritems())
next(df.items()) # these two are equivalent

('first_name',
 0 Katherine
 1 Sarah
 2 Karen
 3 David
 4 Phillip
          ...     
 95 Robert
 96 Christopher
 97 Kristen
 98 Nicholas
 99 Caroline
 Name: first_name, Length: 100, dtype: object)

items. This is the same as above. iteritems actually just invokes items.

next(df.iterrows())

(0,
 first_name Katherine
 last_name Moody
 start_date 2020-02-04 00:00:00
 end_date 2021-06-28 00:00:00
 city Longberg
 state Maryland
 zipcode 20496
 balance 493
 score A
 score3 A
 Name: 0, dtype: object)

iterrows. We already have seen this, it iterates through the rows, but returns them as a tuple of index and the row, as a Series.
itertuples. Iterates over the rows, returning a namedtuple for each row. You can optionally change the name of the tuple and disable the index being returned.

next(df.itertuples())

Pandas(Index=0, first_name='Katherine', last_name='Moody', start_date=Timestamp('2020-02-04 00:00:00'), end_date=Timestamp('2021-06-28 00:00:00'), city='Longberg', state='Maryland', zipcode='20496', balance=493, score='A', score3='A')

Using itertuples

Since we already looked at iterrows, we only need to look at itertuples. As you can see, the returned value, a namedtuple, can be used in our original function.

assert ([score_customer(c[1]) for c in df.iterrows()] == df['score']).all()

%timeit [score_customer(t) for t in df.itertuples()] 

858 µs ± 5.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The performance here is pretty good, over 12x faster. The construction of a namedtuple for each row is much faster than construction of a Series.

Mixed types in a row

Now is a good time to bring up another difference between iterrows and itertuples. A namedtuple can properly represent any type in a single row. In our case, we have strings, date types, and integers. A pandas Series, however, has to have only one datatype for the entire Series. Because our datatypes were diverse enough, they were all represented as object types, and ended up retaining their type, with no functionality issues for us. But this is not always the case!

If your columns have different numerical types, for example, they will end up being the type that can represent all of them. This can cause your data returned by itertuples and iterrows to be slightly different between these two methods, so watch out.

dfmixed = pd.DataFrame({'integer_column': [1,2,3], 'float_column': [1.1, 2.2, 3.3]})
dfmixed.dtypes

integer_column int64
float_column float64
dtype: object

next(dfmixed.itertuples())

Pandas(Index=0, integer_column=1, float_column=1.1)

next(dfmixed.iterrows())

(0,
 integer_column 1.0
 float_column 1.1
 Name: 0, dtype: float64)

Column names

One other word of warning. If your DataFrame has columns that cannot be represented as Python variable names, you will not be able to access them using dot syntax. So if you have a column named 2b or My Column then you’ll have to access them using positional names (i.e. the first column will be called _1). For iterrows, the row will be a Series, so you’ll have to access the columns using ["2b"] or ["My Column"].

Other choices

There are other options for iteration, of course. For example, you could increment an integer offset and use the iloc indexer on the DataFrame to select any row. Of course, this is really no different from any other iteration, while also being non-idiomatic so others reading your code will probably find it hard to read and understand. I built a naive version of this in the performance comparison code for the summary below, if you want to see it (the performance was horrible).

Choosing well

Choosing the right solution depends on essentially two factors:

How big is your data set?
What can you write (and maintain) easily?

In the image below, you can see the running time for the solutions we’ve considered (the code to generate this is here). As you can see, only the vectorized solution holds up well with larger data. If your data set is huge, vectorized solutions may be your only reasonable choice.

Comparative runtimes for various methods on our DataFrame.

However, depending on how many times you need to execute your code, how long it takes you to write it correctly, and how well you can maintain it going forward, you may choose any of the other solutions and be fine. In fact, they all grow linearly with increasing data for these solutions.

Maybe one way to think about this is not just big-O notation, but “big-U” notation. In other words, how long will it take YOU to write a correct solution? If it’s less than the running time of your code, an iterative solution may be totally fine. However, if you’re writing production code, take the time to learn how to vectorize.

One other point; sometimes writing the iterative solution on a smaller set is easy, and you may want to do that first, then write the vectorized version. Verify your results with the iterative solution to make sure you did it correctly, then use the vectorized version on the larger full data set.

I hope you’ve found this dive into DataFrame iteration interesting. I know I learned a few useful things along the way.

The post How to iterate over DataFrame rows (and should you?) appeared first on wrighters.io.

DEV Community

How to iterate over pandas DataFrame rows (and should you?)

But I have heard that iteration is wrong, is that true?

Now what do we want to do with the `DataFrame`?

Example

A first attempt

Scoring our customers

Is this our best choice?

Speed and Memory

Vectorization

Cython

List comprehensions

DataFrame.apply

DataFrame.iteritems and DataFrame.itertuples

Using itertuples

Mixed types in a row

Column names

Other choices

Choosing well

Top comments (0)

But I have heard that iteration is wrong, is that true?

Now what do we want to do with the DataFrame?

Example

A first attempt

Scoring our customers

Is this our best choice?

Speed and Memory

Vectorization

Cython

List comprehensions

DataFrame.apply

DataFrame.iteritems and DataFrame.itertuples

Using itertuples

Mixed types in a row

Column names

Other choices

Choosing well

Now what do we want to do with the `DataFrame`?