DEV Community

wrighter
wrighter

Posted on • Originally published at wrighters.io on

Boolean Indexing in Pandas

This is the third post in the series on indexing and selecting data in pandas. If you haven’t read the others yet, see the first post that covers the basics of selecting based on index or relative numerical indexing, and the second post, that talks about slicing. In this post, I’m going to talk about boolean indexing which is the way that I usually select subsets of data when I work with pandas.

What is boolean indexing?

For those familiar with NumPy, this may already be second nature, but for beginners it is not so obvious. Boolean indexing works for a given array by passing a boolean vector into the indexing operator ([]), returning all values that are True.

One thing to note, this array needs to be the same length as the array dimension being indexed.

Let’s look at an example.

>>> import pandas as pd
>>> import numpy as np
>>>
>>> a = np.arange(5)
>>> a
array([0, 1, 2, 3, 4])
Enter fullscreen mode Exit fullscreen mode

Now we can select the first, second and last elements of our array using a list of array indices.

>>> a[[0, 1, 4]]
array([0, 1, 4])
Enter fullscreen mode Exit fullscreen mode

Boolean indexing can do the same, by creating a boolean array of the same size as the entire array, with elements 0, 1 and 4 set to True, all others False.

>>> mask = np.array([True, True, False, False, True])
>>> a[mask]
array([0, 1, 4])
Enter fullscreen mode Exit fullscreen mode

Boolean operators

So now we know how to index our array with a single boolean array. But building that array by hand is a pain, so what you will usually end up doing is applying operations to the original array that return a boolean array themselves.

For example, to select all elements less than 3:

>>> a[a < 3]
array([0, 1, 2])
Enter fullscreen mode Exit fullscreen mode

or all even elements:

>>> a[a % 2 == 0]
array([0, 2, 4])
Enter fullscreen mode Exit fullscreen mode

And we can combine these using expressions, to AND them (&) or OR them (|). With these operators, we can select the same elements from our first example.

>>> a[(a < 2) | (a >= 4)]
array([0, 1, 4])
Enter fullscreen mode Exit fullscreen mode

Another very helpful operators is the inverse or not operator, (~). Remember to watch your parentheses.

>>> a[~((a < 2) | (a >= 4))]
array([2, 3])
Enter fullscreen mode Exit fullscreen mode

On to pandas

In pandas, boolean indexing works pretty much like in NumPy, especially in a Series. You pass in a vector the same length as the Series. Note that this vector doesn’t have to have an index, but if you use a Series as the argument, it does have an index so you need to be aware of how your index aligns. A common method of using boolean indexing is to apply functions to the original Series so your index will always match.

Series

>>> s = pd.Series(np.arange(5), index=list("abcde"))
>>> s
a 0
b 1
c 2
d 3
e 4
dtype: int64
>>> s[[True, True, False, False, True]] # this vector is just a list of boolean values
a 0
b 1
e 4
dtype: int64
>>> s[np.array([True, True, False, False, True])] # this vector is a NumPy array of boolean values
a 0
b 1
e 4
dtype: int64
Enter fullscreen mode Exit fullscreen mode

But, since our index is not the default (i.e. not a RangeIndex), if we use another Series of the same length, it will not work. It needs a matching index, and the default index created below doesn’t match our character index. So we have to specify the index to match.

>>> try:
...     s[pd.Series([True, True, False, False, True])]
... except Exception as ie:
...     print(ie)

Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
>>> s[pd.Series([True, True, False, False, True], index=list("abcde"))]
a 0
b 1
e 4
dtype: int64
Enter fullscreen mode Exit fullscreen mode

But instead of making a new Series, we’ll just base all of our expressions on our source data Series or DataFrame, then they’ll share an index.

>>> # just like before with NumPy
>>> s[(s < 2) | (s > 3)]
a 0
b 1
e 4
dtype: int64
Enter fullscreen mode Exit fullscreen mode

Make note that you need to surround each expression with parentheses because the Python parser will apply the boolean operators incorrectly. For the example above, it would apply it as s < (2 | s ) < 3. You’ll realize you’re forgetting parentheses when you get complaints about the boolean operators being applied to a series. See?

>>> s[s < 2 | s > 3]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mcw/.pyenv/versions/pandas/lib/python3.8/site-packages/pandas/core/generic.py", line 1329, in __nonzero__
    raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Enter fullscreen mode Exit fullscreen mode

DataFrame

We can also do boolean indexing on DataFrames. A popular way to create the boolean vector is to use one or more of the columns of the DataFrame.

>>> df = pd.DataFrame({'x': np.arange(5), 'y': np.arange(5, 10)})
>>> df[df['x'] < 3]
   x y
0 0 5
1 1 6
2 2 7
Enter fullscreen mode Exit fullscreen mode

You can also supply multiple conditions, just like before with Series. (Remember those parentheses!)

>>> df[(df['x'] < 3) & (df['y'] > 5)]
   x y
1 1 6
2 2 7
Enter fullscreen mode Exit fullscreen mode

.loc, .iloc, and []

If you remember from previous posts, pandas has three primary ways to index the containers. The indexing operator ([]) is sort of a hybrid of using the index labels or location based offsets. .loc is meant for using the index labels, .iloc is for integer based indexing. The good news is that all of them accept boolean arrays, and return subsets of the underlying container.

>>> mask = (df['x'] < 3) & (df['y'] > 5)
>>> mask
0 False
1 True
2 True
3 False
4 False
dtype: bool
>>> df[mask]
   x y
1 1 6
2 2 7
>>> df.loc[mask]
   x y
1 1 6
2 2 7

Enter fullscreen mode Exit fullscreen mode

Note that .iloc is a little different than the others, if you pass in this mask, you’ll get an exception.

>>> try:
...     df.iloc[mask]
... except NotImplementedError as nie:
...     print(nie)
...
iLocation based boolean indexing on an integer type is not available
Enter fullscreen mode Exit fullscreen mode

This is by design, .iloc is only intended to take positional arguments. However, our mask is a Series with an index, so it is rejected. You can still pass in a boolean vector, but just pass in the vector itself without the index.

>>> df.iloc[mask.to_numpy()]
   x y
1 1 6
2 2 7
>>> # or
>>> df.iloc[mask.values]
   x y
1 1 6
2 2 7
Enter fullscreen mode Exit fullscreen mode

Examples!

I think one of the most helpful things when thinking about boolean indexing is to see some examples. You are only limited by what you can express by grouping together any combination of expressions on your data. You do this by carefully grouping your boolean expressions and using parentheses wisely. It can also help to break the problem into pieces as you work on it.

In this series I’ve been grabbing data from the Chicago Data Portal. This time, I thought the list of lobbyists might be interesting. Due to the need for lobbyists to re-register every year, there’s some repeating data. Let’s take a look.

>>> # you should be able to grab this dataset as an unauthenticated user, but you can be rate limited
>>> lbys = pd.read_json("https://data.cityofchicago.org/resource/tq3e-t5yq.json")
>>> lbys.dtypes
year int64
lobbyist_id int64
salutation object
first_name object
last_name object
address_1 object
city object
state object
zip object
country object
email object
phone object
fax object
employer_id int64
employer_name object
created_date object
middle_initial object
address_2 object
suffix object
dtype: object
>>> lbys['created_date'] = pd.to_datetime(lbys['created_date'])
>>> # I'll drop the personally identifiable data, just to be nice
>>> lbys = lbys.drop(['email', 'phone', 'fax', 'last_name'], axis=1)
>>>
>>> lbys.head(3)
   year lobbyist_id salutation first_name address_1 ... employer_name created_date middle_initial address_2 suffix
0 2020 18883 MR. PERICLES 6969 W WABANSIA AVE ... THE PERICLES ORGANIZATION 2020-08-07 NaN NaN NaN
1 2019 18883 NaN PERICLES 6969 W WABANSIA AVE ... THE PERICLES ORGANIZATION 2020-01-21 NaN NaN NaN
2 2018 18883 NaN PERICLES 6969 W WABANSIA AVE ... THE PERICLES ORGANIZATION 2018-12-12 NaN NaN NaN

[3 rows x 15 columns]
Enter fullscreen mode Exit fullscreen mode

In terms of examples, there’s not really too much complexity to deal with, but here’s a few to give you an idea what boolean indexing looks like.

>>> lbys[lbys['year'] == 2020] # all lobbyists registered in 2020
     year lobbyist_id salutation first_name ... created_date middle_initial address_2 suffix
0 2020 18883 MR. PERICLES ... 2020-08-07 NaN NaN NaN
4 2020 17521 MR. STEVE ... 2020-01-15 NaN NaN NaN
.. ... ... ... ... ... ... ... ... ...
998 2020 24740 MRS. CHRISONIA ... 2020-01-16 D. SUITE 1700 NaN
999 2020 15081 MS. LIZ ... 2020-05-29 NaN STE. 900 NaN

[125 rows x 15 columns]
>>> lbys[(lbys['year'] == 2020) & (lbys['city'] == 'CHICAGO')] # lobbyists registered in 2020 from Chicago
     year lobbyist_id salutation first_name ... created_date middle_initial address_2 suffix
0 2020 18883 MR. PERICLES ... 2020-08-07 NaN NaN NaN
4 2020 17521 MR. STEVE ... 2020-01-15 NaN NaN NaN
.. ... ... ... ... ... ... ... ... ...
998 2020 24740 MRS. CHRISONIA ... 2020-01-16 D. SUITE 1700 NaN
999 2020 15081 MS. LIZ ... 2020-05-29 NaN STE. 900 NaN

[76 rows x 15 columns]
>>> # let's get the most popular employer for 2020
>>> pop_emp_id = lbys[lbys['year'] == 2020].groupby('employer_id').count().sort_values(by='lobbyist_id', ascending=False).index[0]
>>> # who works for them?
>>> lbys[(lbys['employer_id'] == pop_emp_id) & (lbys['year'] == 2020)]
     year lobbyist_id salutation first_name address_1 ... employer_name created_date middle_initial address_2 suffix
619 2020 24484 NaN BRIAN 1330 W FULTON ST ... STERLING BAY, LLC AND ITS AFFILIATES 2020-01-15 NaN STE 800 NaN
654 2020 24106 NaN HOWARD 1330 W. FULTON ST ... STERLING BAY, LLC AND ITS AFFILIATES 2020-01-17 NaN SUITE 800 NaN
982 2020 23828 MS. SHELLY 1330 W. FULTON ST ... STERLING BAY, LLC AND ITS AFFILIATES 2020-01-15 NaN SUITE 800 NaN

[3 rows x 15 columns]
Enter fullscreen mode Exit fullscreen mode

If we only want to deal with 2020 data, we can just make a new smaller DataFrame with that data.

lbys = lbys[lbys['year'] == 2020]
Enter fullscreen mode Exit fullscreen mode

Boolean indexing with isin

A helpful method that is often paired with boolean indexing is Series.isin. It returns a boolean vector with all rows that match one of the elements in the arguments.

>>> lbys['state'].tail()
995 MO
996 MO
997 IL
998 IL
999 IL
Name: state, dtype: object
>>> lbys['state'].isin(['IL']).tail()
995 False
996 False
997 True
998 True
999 True
Name: state, dtype: bool
>>> lbys[lbys['state'].isin(['WI', 'IA', 'MO', 'KY', 'IN'])] # lobbyists from bordering states
     year lobbyist_id salutation first_name ... created_date middle_initial address_2 suffix
164 2020 24879 NaN DAN ... 2020-06-15 NaN NaN NaN
786 2012 4644 MR. BRIAN ... 2012-05-25 NaN NaN NaN
788 2019 15543 MR. LORENZO ... 2019-01-10 NaN NaN NaN
789 2018 15543 MR. LORENZO ... 2018-01-17 NaN NaN NaN
790 2017 15543 MR. LORENZO ... 2017-01-19 NaN NaN NaN
791 2020 15543 MR. LORENZO ... 2020-01-10 NaN NaN NaN
792 2016 15543 MR. LORENZO ... 2016-10-13 NaN NaN NaN
994 2016 10222 MS. MARILYN ... 2016-01-13 NaN MO1-800-14-40 NaN
995 2015 10222 MS. MARILYN ... 2015-01-14 NaN MO1-800-14-40 NaN
996 2014 10222 MS. MARILYN ... 2014-05-06 NaN MO1-800-14-40 NaN

[10 rows x 15 columns]
Enter fullscreen mode Exit fullscreen mode

I’ll wrap it up with a slightly more complicated expression.

>>> lbys[
... ~(lbys['state'].isin(['WI', 'IA', 'MO', 'KY', 'IN'])) & # lobbyists NOT from bordering states
... (lbys['state'] != 'IL') & # and NOT from IL
... (lbys['created_date'] >= '2020-07-01') # created in the last half of the year
... ]
     year lobbyist_id salutation first_name ... created_date middle_initial address_2 suffix
26 2021 24967 NaN JON ... 2020-12-27 NaN SUITE 104 NaN
27 2020 24967 NaN JON ... 2020-10-21 NaN SUITE 104 NaN
34 2021 24901 NaN JOSEPH ... 2020-12-20 NaN NaN NaN
160 2021 23782 MR. ALAN ... 2021-01-01 P SUITE 404-352 NaN
227 2020 24909 MS. LAKEITHA ... 2020-07-14 NaN NaN NaN
332 2021 22341 NaN CAROLINE ... 2020-12-23 NaN NaN NaN
511 2021 5361 NaN JAY ... 2020-12-27 NaN SUITE 450 NaN
609 2021 4561 NaN BRET ... 2020-12-18 NaN NaN NaN
660 2021 13801 MR. JOHN ... 2020-12-27 NaN MC 482-C30-C76 NaN
700 2020 24925 NaN TAMI ... 2020-08-06 NaN NaN NaN
862 2020 24969 MR. ALEX ... 2020-08-31 NaN NaN NaN
951 2021 6164 MS. GABRIELLE ... 2020-12-21 NaN 73RD FLOOR NaN

[12 rows x 15 columns]
Enter fullscreen mode Exit fullscreen mode

I also find it helpful to sometimes create a variable for storing the mask. So for the above example, instead of having to parse the entire expression when reading the code, it can be helpful to have expressive variable names for the parts of the indexing expression.

>>> non_bordering = ~(lbys['state'].isin(['WI', 'IA', 'MO', 'KY', 'IN']))
>>> non_illinois = (lbys['state'] != 'IL')
>>>
>>> # more readable maybe?
>>> lbys[non_bordering & non_illinois & (lbys['created_date'] >= '2020-07-01')]
Enter fullscreen mode Exit fullscreen mode

Often when building a complex expression, it can be helpful to build it in pieces, so assigning parts of the mask to variables can make a much more complicated expression easier to read, at the cost of extra variables to deal with. In general, I use variables in the mask if I have to reuse them multiple times, but if only used once, I do the entire expression in place.

In summary, boolean indexing is really quite simple, but powerful. I’ll be looking at a few other ways to select data in pandas in upcoming posts.

Discussion (0)