DEV Community

wrighter
wrighter

Posted on • Originally published at wrighters.io on

Selection in pandas using query

It’s time for a different way to do selection in pandas. The query method can provide a few benefits to your code, and offers potential speed improvements. Read on to find out more.

This is the sixth post in a series on indexing and selecting in pandas, and so far we have covered concepts about indexing and selecting data in both pandas Series and DataFrames using concepts like slicing, boolean indexing, and familiar python concepts like callables. This post will cover the details behind the query method and requires a little more background and explanation. In the end, you’ll see how it’s useful and how it can help speed up your calculations.

Before we dive in, if you are jumping in the middle and want to get caught up, here is the rest of the series:

Using query

In pandas, DataFrames have a query method that supports selection using an expression as a string. One way to think about this (and a natural way to think about it due to its name and basic functionality) is that it can sort of be like SQL. But I’m not sure that’s a great way to think about it, for a couple of reasons. First, there is a limit in language support (i.e. don’t count on writing queries with complex joins or group by expressions). Second, the purpose of this expression is not just to give you a simple query language for selecting data, but to speed up expressions.

That being said, using query does offer a number of advantages, with a few complications, which we will cover.

Let’s start with a few examples of what an expression in query might look like.

>>> import pandas as pd
>>> import numpy as np
>>> # create a sample dataframe of floating points around 0
>>> df = pd.DataFrame(np.random.random((5,5)) - .5, columns=list("abcde"))
>>> df
          a         b        c         d         e
0 -0.152062  0.320043 0.445840 -0.434280  0.404650
1  0.112074 -0.449177 0.190247 -0.471130  0.304824
2 -0.090032 -0.414783 0.388579  0.363154 -0.291303
3  0.223888 -0.202376 0.053828 -0.070127  0.005833
4  0.417009 -0.251136 0.268446  0.020803  0.265924
Enter fullscreen mode Exit fullscreen mode

query is still boolean indexing

If you look back on the post on boolean indexing, you see that to select rows in a DataFrame using any of the indexing methods, you can pass in a boolean vector the same size as the DataFrame index. So, if we want to select all rows in our DataFrame where the value in column b is negative, we pass in a boolean array created by applying the less than (<) operation to the column.

>>> df[df["b"] < 0]
          a         b        c         d         e
1  0.112074 -0.449177 0.190247 -0.471130  0.304824
2 -0.090032 -0.414783 0.388579  0.363154 -0.291303
3  0.223888 -0.202376 0.053828 -0.070127  0.005833
4  0.417009 -0.251136 0.268446  0.020803  0.265924
Enter fullscreen mode Exit fullscreen mode

The query method is similar. It takes an expression, which is evaluated in the context of the DataFrame, and that expression returns a boolean value used for selection. So we can write the same selection like this:

>>> df.query("b < 0")
          a         b        c         d         e
1  0.112074 -0.449177 0.190247 -0.471130  0.304824
2 -0.090032 -0.414783 0.388579  0.363154 -0.291303
3  0.223888 -0.202376 0.053828 -0.070127  0.005833
4  0.417009 -0.251136 0.268446  0.020803  0.265924
Enter fullscreen mode Exit fullscreen mode

More complex expressions can be expressed in a simpler syntax. For example, to select rows where column b is less than column c and column c is less than column e, we need to use a fair amount of parentheses and repeat our df variable in order to have it parse properly.

>>> df[(df["b"] < df["c"]) & (df["c"] < df["e"])]
         a         b        c        d        e
1 0.112074 -0.449177 0.190247 -0.47113 0.304824
Enter fullscreen mode Exit fullscreen mode

Using query, this can be written in a much more elegant way. For combining and modifying expressions, you can use & and | and ~, or and and or and not.

>>> df.query("(b < c) & (c < e)") # or
>>> df.query("(b < c) and (c < e)")
         a         b        c        d        e
1 0.112074 -0.449177 0.190247 -0.47113 0.304824
Enter fullscreen mode Exit fullscreen mode

Or even better

>>> df.query("b < c < e")
         a         b        c        d        e
1 0.112074 -0.449177 0.190247 -0.47113 0.304824
Enter fullscreen mode Exit fullscreen mode

You can also use your index in the query, by using index, or you can rename your index and use that.

>>> df.query("index > 2")
>>> df.index.name = 'idx'
>>> df.query("idx > 2")
            a      b        c         d        e
idx
3 0.223888 -0.202376 0.053828 -0.070127 0.005833
4 0.417009 -0.251136 0.268446  0.020803 0.265924
Enter fullscreen mode Exit fullscreen mode

Using variables in your query expression

You can also use variables that are in scope inside your expression by prefixing them with an @.

>>> limit = 0.1
>>> df.query("b > @limit")
            a      b       c        d       e
idx
0 -0.152062 0.320043 0.44584 -0.43428 0.40465
Enter fullscreen mode Exit fullscreen mode

If your columns are not valid Python variable names, you’ll have to surround them with backticks. (Note you’ll need pandas 0.25 or higher for this functionality, and 1.0+ for expanded support).

>>> df['mean value'] = df.mean(axis=1)
>>> df.query('`mean value` > d')
          a         b        c         d        e  mean value
idx
0 -0.152062  0.320043 0.445840 -0.434280 0.404650  0.116838
1  0.112074 -0.449177 0.190247 -0.471130 0.304824 -0.062632
3  0.223888 -0.202376 0.053828 -0.070127 0.005833  0.002209
4  0.417009 -0.251136 0.268446  0.020803 0.265924  0.144209
Enter fullscreen mode Exit fullscreen mode

The in operator is supported in query, so you can use that as well, either using a variable, or existing columns, or as literals. The === and !=operators work like in and not in for lists.

>>> df['name'] = ['x', 'y', 'z', 'p', 'q']
>>> search = ['x', 'y']
>>> df.query("name in ['x', 'y']")
# or
>>> df.query('name in @search')
# or
>>> df.query('name == @search')
# or
>>> df.query("name == ['x', 'y']")
            a      b      c       d      e mean value name
idx
0 -0.152062  0.320043 0.445840 -0.43428 0.404650  0.116838 x
1  0.112074 -0.449177 0.190247 -0.47113 0.304824 -0.062632 y
Enter fullscreen mode Exit fullscreen mode

What about updates?

This expression is handy, but what about updating your DataFrame? I know we’ve mostly talked about selecting data, but it turns out that you can also update or add columns by using expressions, but with the eval method. Note that eval returns a copy of the modified data, so you need to assign back to your variable or use the inplace argument to retain the changes.

>>> df.eval('x = a * b')['x']
idx
0   -0.048666
1   -0.050341
2    0.037344
3   -0.045310
4   -0.104726
Name: x, dtype: float64
Enter fullscreen mode Exit fullscreen mode

Note you can pass in multiple expressions to be evaluated in one go.

df.eval('''
        x = a * b
        y = c * d
        ''')[['x', 'y']]
            x         y
idx
0   -0.048666 -0.193619
1   -0.050341 -0.089631
2    0.037344  0.141114
3   -0.045310 -0.003775
4   -0.104726  0.005584
Enter fullscreen mode Exit fullscreen mode

What about updates with multiple values?

It turns out there is a also a higher level pandas eval function available for doing more complicated updates involving multiple DataFrames or Series. I won’t get it into detail here, but you can use it for the same sorts of expressions as above.

>>> s = pd.Series(np.random.random(5))
>>> pd.eval('g = df.a + s', target=df)['g']
idx
0    0.661418
1    0.814406
2    0.663946
3    0.435111
4    1.012334
Name: g, dtype: float64
Enter fullscreen mode Exit fullscreen mode

One other nice use case

One more nice advantage of using query (and eval) is that you may have a situation where you have multiple DataFrames in your code that share column names where you’d like to use the same expression on them. If you didn’t use query, you’d have to pass these into a function so you refer to these in local python expressions. So as these expressions get more complex, query may make sense for code clarity.

df2 = pd.DataFrame(np.random.random((5,5)) - .5, columns=list("abcde"))
# without query, need to build expression in python
def find_d_gt_e(d):
    return d[d['d'] > d['e']]
find_d_gt_e(df)
find_d_gt_e(df2)
expr = "d > e" # same expression, re-usable anywhere these columns exist
df.query(expr)
df2.query(expr);
Enter fullscreen mode Exit fullscreen mode

But what’s the point?

So there are a few advantages to the mini-langage used by query and eval, resulting in cleaner looking code. But there are still some disadvantages to using a differing query language. First, when passing in a string, there’s no syntax checking being done by your editor. You may find this annoying or it may make you slightly less efficient. Also, you will need to learn a new set of rules for this mini language. The numexpr docs have a concise summary, but pandas adds some more complexity you’ll need to consider.

It’s mainly about speed

What is not entirely obvious here is that under the hood you can install a nice library called numexpr (docs, src) that exists to make calculations with large NumPy (and pandas) objects potentially much faster. When you use query or eval, this expression is passed into numexpr and optimized using its bag of tricks. Expected performance improvement can be between .95x and up to 20x, with average performance around 3-4x for typical use cases. You can read details in the docs, but essentially numexpr takes vectorized operations and makes them work in chunks that optimize for cache and CPU branch prediction. If your arrays are really large, your cache will not be hit as often. If you break your large arrays into very small pieces, your CPU won’t be as efficient.

Make sure you install it in your environment first if it’s not there.

%pip install numexpr # in iPython or Jupyter notebooks
pip install numexpr # in your shell
Enter fullscreen mode Exit fullscreen mode

Here’s a simple NumPy example from the numexpr docs that shows what the improvement can look like.

>>> import numexpr as ne
>>> a = np.random.rand(int(1e6))
>>> b = np.random.rand(int(1e6))
>>> r1 = %timeit -o 2 * a + 3 * b # assumes you're in iPython or Jupyter notebook
>>> r2 = %timeit -o ne.evaluate("2 * a + 3 * b")
5.57 ms ± 313 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.65 ms ± 31.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> print(f"{r1.average / r2.average:.2f}x faster")
3.38x faster
Enter fullscreen mode Exit fullscreen mode

You can see much more detail in the numexpr docs if you’re interested, but for a pandas specific example, let’s see what the difference is when using numexpr and not for some DataFrames in the size range where we can start to expect an improvement (over 200k rows).

>>> sd = pd.DataFrame(np.random.random(int(3e7)).reshape(int(1e7), 3))
>>> sd2 = pd.DataFrame(np.random.random(int(3e7)).reshape(int(1e7), 3))
>>> r1 = %timeit -o 3 * sd + 4 * sd2
174 ms ± 23.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> r2 = %timeit -o pd.eval('3 * sd + 4 * sd2')
104 ms ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> print(f"{r1.average / r2.average: .2f}x faster")
 1.66x faster
Enter fullscreen mode Exit fullscreen mode

Depending on what you are doing, some operations will be sped up even more.

>>> r1 = %timeit -o 3 * sd + sd2 ** 4
286 ms ± 56.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> r2 = %timeit -o pd.eval('3 * sd + sd2 ** 4')
101 ms ± 6.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> print(f"{r1.average / r2.average: .2f}x faster")
 2.84x faster
Enter fullscreen mode Exit fullscreen mode

Wrapping up

When you’re working with initial exploration of data, or smaller data sets, using query and eval will probably not be your go-to options for data selection. But you should consider this method for speeding up code with large data sets or in speed critical areas, it can make a big difference. It’s also useful for cleaning up code with complicated selection criteria.

I also think this is a good place to wrap up the series on selecting and indexing in pandas. Now that we’ve covered the base, there are many more topics worth looking at.

The post Selection in pandas using query appeared first on wrighters.io.

Top comments (0)