DEV Community

Mark Saward
Mark Saward

Posted on • Originally published at manse.cloud

SQL query builders

My latest attitude towards the database has been to respect it more, by taking advantage of more of the powerful features it has available. That means taking advantage of features specific to my go-to database, PostgreSQL, and locking myself in. I have a pending blog post about my thoughts on that, but for now I wanted to make a quick note on something that seems so simple and obvious, yet I'd never considered before.

Quite often in our application we'll want to customise our WHERE clauses for filtering results. A common example of this is implementing filters for a table of results. To do this, we might loop over a list of filters that come to us from the client (e.g., query parameters in a GET request), and include all and only the filter values which have a value set. If we have one filter requested, then we should add a 'WHERE' clause, and if we have more than one, we should join all the options by 'AND'. It's a small thing to do, but in building up our SQL out of joining strings together, we need to be a little careful to construct it right.

Alternatively, when we're using an ORM or query builder (for Go, I like to use squirrel), adding the correct 'WHERE' and 'AND' terms is handled automatically. A small convenience that we could do ourselves, but then we'll find ourselves building our own in-house micro-query-builder.

The nice thing about not using a query builder or ORM is that you can work directly with the SQL, a great advantage when you're familiar with it and know just what you want to write. You don't run into situations where you think, "I know what the SQL query needs to be, but I don't know how to get this ORM/Query Builder to write that SQL". The problem is you end up composing your SQL queries using your own equivalent of a querybuilder anyway, and then the details of your queries are split all over the place. You can't easily look in one place, and see the query as a whole to understand it. Instead, it's scattered in piecemeal strings throughout your function or wider code.

I've been playing around with Rust some more, and looking at sqlx as an option that lets me stay close to the SQL, with some additional compile time checks. sqlx sends queries to the database at compile time to check that they are valid, which sounds great -- compile time checks for validity save time.

This, of course, won't work when you're constructing your clauses from parts and therefore don't have the whole query written anywhere at compile time that can be checked against the database. For example (written in Go), you might have:

package main

import (
    "fmt"
    "strings"
)

var filters = []struct {
    Name  string
    Value interface{}
}{
    {
        Name:  "name",
        Value: "Homer",
    },
    {
        Name:  "city",
        Value: "Springfield",
    },
}

func main() {
    query := "SELECT * FROM person"

    var where []string
    var args []interface{}
    index := 0

    // Make sure 'filters' are not provided by the client, or are checked against a list of allowed values, so that you don't give an opening for a SQL injection attack:
    for _, filter := range filters {
        index++

        where = append(where, fmt.Sprintf("%s = $%d", filter.Name, index))
        args = append(args, filter.Value)
    }

    if len(where) > 0 {
        query = query + " WHERE " + strings.Join(where, " AND ")
    }

    fmt.Println(query)

    // db.Query(query, args...)
    _ = args
}
Enter fullscreen mode Exit fullscreen mode

We've built our own home brewed SQL building solution, and this example doesn't even deal with ranges -- for example, finding everyone with a date of birth between two dates (or just after a particular date). Those have a different syntax that also needs to be dealt with, complicating our 'quick' homebrewed query builder. Note also that the full query is scattered across parts throughout this function. The query it generates In this case would be:

SELECT * FROM person WHERE name = $1 AND city = $2
Enter fullscreen mode Exit fullscreen mode

But it's very hard to see that at a glance, and it can't be checked in advance. How do we get compile time checks for these runtime constructed queries? I've been thinking about options where we inspect the database to understand what tables and fields there are, and inside Rust we use a macro where we can use simple tags inside our query string that the compiler uses to check are valid values. We could use this to check that we're at least using correct table and field names.

However, I saw another solution for this in an issue for the sqlx project. And that solution is just to have the query itself handle the optional cases! Instead of including them only if they're set, you always list them in the query inside your code, so you don't need to do any kind of runtime building up of the conditions. If the filter is not provided then the query won't filter on that value. Simple, but effective, and should still let sqlx check the query at compile time! And the query is fully expressed, so it's easy to read all in one place, and therefore easy to understand.

Here's an updated version of the above using this method:

package main

import (
    "database/sql"
)

type filter struct {
    Name      sql.NullString
    City      sql.NullString
    DOBBefore sql.NullTime
    DOBAfter  sql.NullTime
}

func main() {
    var f filter
    f.Name = sql.NullString{String: "Homer", Valid: true}
    f.City = sql.NullString{String: "Springfield", Valid: true}

    query := `
SELECT * FROM person
WHERE 
    ($1 IS NULL OR name = $1) AND
    ($2 IS NULL OR city = $2) AND
    ($3 IS NULL OR dob < $3) AND
    ($4 IS NULL OR dob > $4)
`

    rows, err := db.Query(query, f.Name, f.City, f.DOBBefore, f.DOBAfter)

    [...]
}
Enter fullscreen mode Exit fullscreen mode

This is both shorter and much clearer about what's going on. Morever, that query can be checked at compile time with the database, and is easily portable. It even handles date range filters without needing to enhance our quickly built in-house query builder to consider such filters. Also, since we've hard-coded our filter options, we don't run the risk of accidentally allowing any SQL injection attacks because we forgot to properly check our filter names.

It's such a simple and obvious way to handle these things, but it never occurred to me before. I'll be interested to see if this kind of solution can work in most situations, or if it will have common limitations I haven't yet considered.

Top comments (2)

Collapse
 
darkain profile image
Vincent Milum Jr

One point against that final query you show there. Using "OR" anywhere in a query generally destroys performance entirely, usually resulting in a full table scan. If you're dealing with a couple hundred or maybe a thousand rows, that's fine. In my line of work, a billion rows per table is quite common, and that would literally take hours to execute the query. Not entirely ideal on the DBA side, just to save some cleanliness on the developer's side.

Collapse
 
mark_saward profile image
Mark Saward • Edited

Edit: I should have used an earlier date for my tests, and maybe a second clause. A quick test with a second where clause (not included below) seems to show exactly the same plan is used both with and without the OR parts.

The largest tables I deal with are less than 1 million rows, so my pressures are definitely different to yours.

I did some tests in postgesql, and I can't see any difference in the query plan that the database uses, whether I include OR's or not. I'm far from a database expert, so please let me know if I've misunderstood something.

There are two cases here: a date is given to be filtered on or isn't. There are two approaches for each of these cases -- using an OR, and not using one.

postgres=# CREATE TABLE town (
postgres(#   id SERIAL UNIQUE NOT NULL,
postgres(#   name TEXT NOT NULL,
postgres(#   founded date NOT NULL
postgres(# );
CREATE TABLE
postgres=# insert into town (
postgres(#     name, founded
postgres(# )
postgres-# select
postgres-#     md5(random()::text),
postgres-#     ('01-01-0001'::date + (random() * interval '2000 year'))::date
postgres-# from generate_series(1, 10000000) s(i)
postgres-# ;
INSERT 0 10000000
postgres=# PREPARE dateplan_or(date) AS
postgres-#   SELECT * FROM town WHERE ($1 IS NULL OR founded >= $1);
PREPARE
postgres=#
postgres=# PREPARE dateplan(date) AS
postgres-#   SELECT * FROM town WHERE founded >= $1;
PREPARE
postgres=#
postgres=# PREPARE dateplan_all(date) AS
postgres-#   SELECT * FROM town;
PREPARE
Enter fullscreen mode Exit fullscreen mode

dateplan and dateplan_all are used for the cases where date is filtered on or not, respectively, with no OR. dateplan_or includes a where clause, whether we filter on date or not.

The two plans that are actually used when filtering on date, using dateplan_or and dateplan:

postgres=# EXPLAIN ANALYZE EXECUTE dateplan_or('2000-01-01'::date);
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..147082.16 rows=5408 width=41) (actual time=21.513..296.753 rows=5029 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on town  (cost=0.00..145541.36 rows=2253 width=41) (actual time=7.801..227.530 rows=1676 loops=3)
         Filter: (founded >= '2000-01-01'::date)
         Rows Removed by Filter: 3331657
 Planning Time: 1.879 ms
 JIT:
   Functions: 6
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 1.934 ms, Inlining 0.000 ms, Optimization 13.887 ms, Emission 7.058 ms, Total 22.880 ms
 Execution Time: 298.389 ms
(12 rows)

postgres=# EXPLAIN ANALYZE EXECUTE dateplan('2000-01-01'::date);
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..147082.16 rows=5408 width=41) (actual time=6.009..252.737 rows=5029 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on town  (cost=0.00..145541.36 rows=2253 width=41) (actual time=3.297..211.336 rows=1676 loops=3)
         Filter: (founded >= '2000-01-01'::date)
         Rows Removed by Filter: 3331657
 Planning Time: 0.529 ms
 JIT:
   Functions: 6
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 1.504 ms, Inlining 0.000 ms, Optimization 0.825 ms, Emission 7.730 ms, Total 10.059 ms
 Execution Time: 253.955 ms
(12 rows)
Enter fullscreen mode Exit fullscreen mode

And the two plans for dateplan_or and dateplan_all when no date is given:

postgres=# EXPLAIN ANALYZE EXECUTE dateplan_or(NULL);
                                                     QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Seq Scan on town  (cost=0.00..193458.06 rows=10000006 width=41) (actual time=0.093..593.871 rows=10000000 loops=1)
 Planning Time: 0.234 ms
 Execution Time: 932.317 ms
(3 rows)

postgres=# EXPLAIN ANALYZE EXECUTE dateplan_all(NULL);
                                                     QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Seq Scan on town  (cost=0.00..193458.06 rows=10000006 width=41) (actual time=0.051..599.485 rows=10000000 loops=1)
 Planning Time: 0.106 ms
 Execution Time: 936.658 ms
(3 rows)
Enter fullscreen mode Exit fullscreen mode

If I create an index, it gets used by both queries when filtering by date:

postgres=# CREATE INDEX founded_idx ON town (founded);
CREATE INDEX
postgres=# EXPLAIN ANALYZE EXECUTE dateplan_or('2000-01-01'::date);
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on town  (cost=66.78..17575.03 rows=5464 width=41) (actual time=3.113..19.385 rows=5029 loops=1)
   Recheck Cond: (founded >= '2000-01-01'::date)
   Heap Blocks: exact=4891
   ->  Bitmap Index Scan on founded_idx  (cost=0.00..65.42 rows=5464 width=0) (actual time=1.499..1.500 rows=5029 loops=1)
         Index Cond: (founded >= '2000-01-01'::date)
 Planning Time: 0.727 ms
 Execution Time: 19.813 ms
(7 rows)

postgres=# EXPLAIN ANALYZE EXECUTE dateplan('2000-01-01'::date);
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on town  (cost=66.78..17575.03 rows=5464 width=41) (actual time=1.522..14.834 rows=5029 loops=1)
   Recheck Cond: (founded >= '2000-01-01'::date)
   Heap Blocks: exact=4891
   ->  Bitmap Index Scan on founded_idx  (cost=0.00..65.42 rows=5464 width=0) (actual time=0.752..0.753 rows=5029 loops=1)
         Index Cond: (founded >= '2000-01-01'::date)
 Planning Time: 0.295 ms
 Execution Time: 15.169 ms
(7 rows)
Enter fullscreen mode Exit fullscreen mode