If you’ve spent any time in pandas at all, you’ve seen
SettingWithCopyWarning. If not, you will soon!
Just like any warning, it’s wise to not ignore it since you get it for a reason: it’s a sign that you’re probably doing something wrong. In my case, I usually get this warning when I’m knee deep in some analysis and don’t want to spend too much time figuring out how to fix it.
I’m going to cover a few typical examples of when this warning shows up, why it shows up, and how to quickly fix the underlying issue.
First, let’s make an example
DataFrame. I’m using a handy Python package called Faker to create some test data. You may need to install it first, with
%pip install Faker # notebook pip install Faker # commmand line
As a quick aside, Faker is a great way to build test data for unit tests, test databases, or examples. It generates real-looking data that is not personally identifiable, since it’s all fake, but it’s based on rules that generate data combinations you’ll likely encounter in real life.
>>> import datetime >>> import pandas as pd >>> import numpy as np >>> from faker import Faker >>> fake = Faker() >>> df = pd.DataFrame([ [fake.first_name(), fake.last_name(), fake.date_of_birth(), fake.date_this_year(), fake.city(), fake.state_abbr(), fake.postalcode()] for _ in range(20)], columns = ['first_name', 'last_name', 'dob', 'lastupdate', 'city', 'state', 'zip']) >>> df.head(3) first_name last_name dob lastupdate city state zip 0 Evan Daniels 1943-05-27 2021-01-11 North Erin AZ 27597 1 Christine Herrera 2019-04-11 2021-01-29 Ellenview AL 28989 2 Michelle Warren 2015-05-29 2021-01-11 Mcknighttown VA 55551
First, let’s just review the ways we can set data in a
DataFrame, using use the
iloc indexers. These are for label based or integer offset based indexing respectively. (See this article for more detail on the two methods)
The first argument in the indexer is for the row, the second is for the column (or columns), and if we assign to this expression, we will update the underlying
Note that the index here is just a
RangeIndex, so the labels are numbers. Because of that, even though I’m passing in int values to
loc, this is looking up by label, not relative index.
>>> df.head(1)['zip'] 0 27597 Name: zip, dtype: object >>> df.loc[0, 'zip'] = '60601' >>> df.head(1)['zip'] 0 60601 Name: zip, dtype: object >>> df.loc[0, ['city', 'state']] = ['Chicago', 'IL'] >>> df.head(1) first_name last_name dob lastupdate city state zip 0 Evan Daniels 1943-05-27 2021-01-11 Chicago IL 60601 >>> # Here's an example of an iloc update. >>> df.iloc[0, 0] = 'Josh' >>> df.head(1) first_name last_name dob lastupdate city state zip 0 Josh Daniels 1943-05-27 2021-01-11 Chicago IL 60601
Now, you can also do updates with the array indexing operator, but this can look very confusing because remember that on a
DataFrame, you are selecting columns first. I’d recommend not doing this for this reason alone, but as you’ll soon see, there are other issues that can arise.
>>> df["first_name"] = 'Joshy' >>> df.head(1) first_name last_name dob lastupdate city state zip 0 Joshy Daniels 1943-05-27 2021-01-11 Chicago IL 60601
OK, now that we have updated our
DataFrame successfully, it’s time to see an example of where things can go wrong. For me, it’s very typical to select a subset of the original data to work with. For example, let’s say that we decide to only work with data where the person was born before 2000.
>>> dob_limit = datetime.date(2000, 1, 1) >>> sub = df[df['dob'] < dob_limit] >>> sub.shape (16, 7) >>> idx = sub.head(1).index # save the location for update attempts below >>> sub.head(1) first_name last_name dob lastupdate city state zip 0 Joshy Daniels 1943-05-27 2021-01-11 Chicago IL 60601
Let’s try to update the
>>> sub.loc[idx, 'lastupdate'] = datetime.date.today() /Users/mcw/.pyenv/versions/3.8.6/envs/pandas/lib/python3.8/site-packages/pandas/core/indexing.py:670: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy iloc._setitem_with_indexer(indexer, value) <ipython-input-14-5f1769c87aaf>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy sub.loc[idx, 'lastupdate'] = datetime.date.today()
Boom! There it is, we are told we are trying to set values on a copy of a slice from a
DataFrame. What ended up happening here? Well,
sub was updated, but
df wasn’t, even though we had the warning.
>>> sub.loc[idx, 'lastupdate'] datetime.date(2021, 2, 4) >>> df.loc[idx, 'lastupdate'] datetime.date(2021, 1, 11)
Pandas is warning you that you might have not done what you expected. When you created
sub, you ended up with a copy of the data in
df. When you updated the value, you’re warned that you only updated the copy, not the original.
There are two primary ways to address this, and which one you choose depends on what you are trying to accomplish in your code. The warning is telling you that you chose a path that could cause confusion or error down the road, and is pointing you toward using the best practices for updating data.
If your intention is to update your original data, you just need to update it directly. So instead of doing your update on
sub, do it on
>>> df.loc[idx, 'lastupdate'] = datetime.date.today() >>> df.loc[idx, 'lastupdate'] datetime.date(2021, 2, 4)
Now note that when you do this, since your view is a copy, it isn’t updated. If you want both
df to match, you need to either update both or recreate
sub after the update. Because of this, it’s important for you to pause and think any time you update a
DataFrame. Have you created views of this data that now need to be refreshed?
If your goal is to update the copy of the data only, to eliminate the warning, tell pandas you want that view to always be a copy.
>>> sub2 = df[df['dob'] < dob_limit].copy() >>> sub2.loc[idx, 'lastupdate'] = datetime.date.today() >>> sub2.loc[idx, 'lastupdate'] datetime.date(2021, 2, 4)
One common situation that happens is an initial full sized
DataFrame is narrowed down to a much smaller one by filtering the data. Maybe new columns are added as part of some calculations, and then as a final result, the original
DataFrame should be updated. One way to do that is to use the index to help you out.
>>> sub3 = df[df['dob'] < dob_limit].copy() # we'll be updating this DataFrame >>> sub3['manualupdate'] = datetime.date.today() - datetime.timedelta(days=10) # you can modify this DataFrame >>> sub3 = sub3.head(3) # or even make it smaller >>> sub3['manualupdate'] 0 2021-01-25 1 2021-01-25 3 2021-01-25 Name: manualupdate, dtype: object
Now, we’ll use the fact that
sub3 shares an index with the original
df to use it to update the data. We can update all matching row of column
lastupdate for example.
>>> df.loc[sub3.index, 'lastupdate'] = sub3['manualupdate'] >>> df.loc[sub3.index] first_name last_name dob lastupdate city state zip 0 Joshy Daniels 1943-05-27 2021-01-25 Chicago IL 60601 3 Vernon Hernandez 1989-04-10 2021-01-25 South Mark NE 05048 4 Mary Munoz 1933-03-16 2021-01-25 Ewingborough OK 31127
Now, you can see that those rows were updated from our smaller subset of data.
You also may encounter this warning when working with subsets of columns in a
>>> df_d = df[['zip']] >>> df_d.loc[idx, 'zip'] = "00313" # SettingWithCopyWarning
A great way to suppress the warning here is to do a full slice with
loc in your initial selection. You can also use
>>> df_d = df.loc[:, ['zip']] >>> df_d.loc[idx, 'zip'] = "00313"
Now you can read about this warning in many other places, and if you’ve come here through a search engine maybe you’ve already found them either confusing or not directly applicable to your situation. I took a slightly different approach above to show the situation where I usually see this error. However, a more common reason new pandas users encounter this error is when trying to update their
DataFrame using the array index operator (
>>> df[df['dob'] < dob_limit]['lastupdate'] = datetime.date.today() file.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df[df['dob'] < dob_limit]['lastupdate'] = datetime.date.today()
The fix here is pretty straightforward, use
loc. Let’s give that a try.
>>> df.loc[df['dob'] < dob_limit, 'lastupdate'] = datetime.date.today() - datetime.timedelta(days=1) >>> df.loc[df['dob'] < dob_limit].head(1) first_name last_name dob lastupdate city state zip 0 Joshy Daniels 1943-05-27 2021-02-03 Chicago IL 60601
That works. The warning here was telling us that our first update is (potentially) operating on a copy of our original data. I don’t think this is quite as obvious as our opening case because pandas has some complicated reasons for choosing to sometimes return a copy and sometimes return a view into the original data, and this may not seem obvious when the update is on one line. When it can detect that this is happening, it raises this warning.
This is called chained assignment. The assignment above with the warning is really doing this:
df. __getitem__ (df. __getitem__ ('dob') < dob_limit). __setitem__ ('lastupdate', datetime.date.today())
When you use the array index operator, the
__setitem__ methods are invoked for getting and setting respectively. That first function call to
__getitem__ is returning a copy of the data, then attempting to set data on it, triggering the warning.
If we use
loc, though, it will be doing this, without returning a temporary view.
df.loc. __setitem__ ((df. __getitem__ ('dob') < dob_limit, 'lastupdate'), datetime.date.today())
So whenever you see this warning, just look at your code and check two things. Did you try to update the data using
? If so, switch to
iloc). If you’re doing that and it’s still complaining, it’s because your
DataFrame was created from another
DataFrame. Either make a full copy if you plant to update it, or update your original
The post Views, Copies, and that annoying SettingWithCopyWarning appeared first on wrighters.io.