Discussion on: How I do solve this impossible SQL query?

View post

Hi @andy , I think it can be simplified.

First thing first, there are a few odd things at first glance.

There's a FULL JOIN between articles and collections but I don't think the query is interested in collections that have no correspondent articles (also because I don't think they can exist in the DB right now). A full join (also called a full outer join) returns all rows of the first table, all rows of the second table and use NULL to fill the gaps.

So the first thing we can do is change

FULL JOIN collections ON collections.id = articles.collection_id

to a simple join

JOIN collections ON collections.id = articles.collection_id

The second odd thing is that the query is joining articles with reactions and page_views but then it disregards the join. How do I know you're not interested in the result of this additional joins? In the outermost SELECT:

SELECT articles.id,
       articles.title,
       articles.published_at,
       articles.positive_reactions_count AS total_reactions,
       articles.cached_tag_list,
       collections.slug AS series_title,

the query only retrieves fields from articles and collections.

At this point one might say: "but rhymes, the query is also counting, summing and averaging" data from reactions and page_views. Yes, but this data is loaded with subqueries that are already filtering with a crucial WHERE statement, without having to join to anything. Now that we know this we can remove the outermost joins as well:

INNER JOIN reactions ON reactions.reactable_id = articles.id AND reactions.reactable_type = 'Article'
INNER JOIN page_views ON page_views.article_id = articles.id

The last odd thing is about the aliasing of the last two aggregated queries, they currently have the same name, I would change the last one to:

(SELECT AVG(page_views.time_tracked_in_seconds)
     from page_views
     where page_views.article_id = articles.id) AS average_time_spent_on_article

change the name to something more meaningful, I'm just guessing what it does

So, at this point our query is:

SELECT articles.id,
       articles.title,
       articles.published_at,
       articles.positive_reactions_count AS total_reactions,
       articles.cached_tag_list,
       collections.slug AS series_title,

    (SELECT COUNT(reactions.id)
     from reactions
     where reactions.category = 'like'
         AND reactions.reactable_id = articles.id
         AND reactions.reactable_type = 'Article') AS total_likes,

    (SELECT COUNT(reactions.id)
     from reactions
     where reactions.category = 'unicorn'
         AND reactions.reactable_id = articles.id
         AND reactions.reactable_type = 'Article') AS total_unicorns,

    (SELECT COUNT(reactions.id)
     from reactions
     where reactions.category = 'readinglist'
         AND reactions.reactable_id = articles.id
         AND reactions.reactable_type = 'Article') AS total_bookmarks,

    (SELECT SUM(page_views.counts_for_number_of_views)
     from page_views
     where page_views.article_id = articles.id) AS total_views,

    (SELECT AVG(page_views.time_tracked_in_seconds)
     from page_views
     where page_views.article_id = articles.id) AS average_time_spent_on_article
FROM articles
LEFT JOIN collections ON collections.id = articles.collection_id
WHERE articles.user_id = 13
    AND articles.published = TRUE;

Much more readable isn't it?

Is there more room for optimization? Certainly. Should we do it? Well, it depends on the performance. Subqueries are not super efficient but they are certainly more readable.

I'm going to leave here the query plan for reference:

+--------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                   |
|--------------------------------------------------------------------------------------------------------------|
| Hash Join  (cost=8.17..60.18 rows=1 width=163)                                                               |
|   Hash Cond: (collections.id = articles.collection_id)                                                       |
|   ->  Seq Scan on collections  (cost=0.00..13.70 rows=370 width=36)                                          |
|   ->  Hash  (cost=8.15..8.15 rows=1 width=71)                                                                |
|         ->  Index Scan using index_articles_on_user_id on articles  (cost=0.14..8.15 rows=1 width=71)        |
|               Index Cond: (user_id = 13)                                                                     |
|               Filter: published                                                                              |
|   SubPlan 1                                                                                                  |
|     ->  Aggregate  (cost=11.30..11.31 rows=1 width=8)                                                        |
|           ->  Bitmap Heap Scan on reactions  (cost=4.17..11.29 rows=1 width=4)                               |
|                 Recheck Cond: ((reactable_type)::text = 'Article'::text)                                     |
|                 Filter: (((category)::text = 'like'::text) AND (reactable_id = articles.id))                 |
|                 ->  Bitmap Index Scan on index_reactions_on_reactable_type  (cost=0.00..4.17 rows=3 width=0) |
|                       Index Cond: ((reactable_type)::text = 'Article'::text)                                 |
|   SubPlan 2                                                                                                  |
|     ->  Aggregate  (cost=11.30..11.31 rows=1 width=8)                                                        |
|           ->  Bitmap Heap Scan on reactions reactions_1  (cost=4.17..11.29 rows=1 width=4)                   |
|                 Recheck Cond: ((reactable_type)::text = 'Article'::text)                                     |
|                 Filter: (((category)::text = 'unicorn'::text) AND (reactable_id = articles.id))              |
|                 ->  Bitmap Index Scan on index_reactions_on_reactable_type  (cost=0.00..4.17 rows=3 width=0) |
|                       Index Cond: ((reactable_type)::text = 'Article'::text)                                 |
|   SubPlan 3                                                                                                  |
|     ->  Aggregate  (cost=11.30..11.31 rows=1 width=8)                                                        |
|           ->  Bitmap Heap Scan on reactions reactions_2  (cost=4.17..11.29 rows=1 width=4)                   |
|                 Recheck Cond: ((reactable_type)::text = 'Article'::text)                                     |
|                 Filter: (((category)::text = 'readinglist'::text) AND (reactable_id = articles.id))          |
|                 ->  Bitmap Index Scan on index_reactions_on_reactable_type  (cost=0.00..4.17 rows=3 width=0) |
|                       Index Cond: ((reactable_type)::text = 'Article'::text)                                 |
|   SubPlan 4                                                                                                  |
|     ->  Aggregate  (cost=1.49..1.50 rows=1 width=8)                                                          |
|           ->  Seq Scan on page_views  (cost=0.00..1.48 rows=6 width=4)                                       |
|                 Filter: (article_id = articles.id)                                                           |
|   SubPlan 5                                                                                                  |
|     ->  Aggregate  (cost=1.49..1.50 rows=1 width=32)                                                         |
|           ->  Seq Scan on page_views page_views_1  (cost=0.00..1.48 rows=6 width=4)                          |
|                 Filter: (article_id = articles.id)                                                           |
+--------------------------------------------------------------------------------------------------------------+

There's a lot of "sub plans" (each sub query is indicated as a "sub plan") and at least two sequential scans we might want to investigate later on: the one on collections and the one on page views.
But let's move on from the sequential scans for now, we can maybe come back to them later.

There's a possible improvement for the query in its current state, using... GROUP BY :D.

GROUP BY

What the query is essentially asking is to return a list of all published articles of the given user with some aggregates. The word aggregation is a giant clue in SQL, whenever you have to aggregate data from one or multiple tables you might want to investigate using grouping.

So, what do we want to aggregate? As we've seen we want the count of all likes, of all unicorns, of all bookmarks, the total of page views and the average time spent on an article.

How to do it?

If likes, unicorns, bookmarks had been in different tables we could have done something like and called it a day:

SELECT articles.id, articles.title, COUNT(likes.id), COUNT(reactions.id)
FROM articles
LEFT JOIN likes ON articles.id = likes.article_id,
LEFT JOIN unicorns ON articles.id = unicorns.article_id
WHERE articles.user_id = 13 AND articles.published = TRUE
GROUP BY articles.id, articles.title;

(the LEFT JOIN is because we want to return articles without likes or unicorns I guessed, it will add 0 to the column)

You can see it working if we simplify the real query to:

SELECT articles.id, articles.title, SUM(page_views.counts_for_number_of_views) AS total_views
FROM articles
LEFT JOIN page_views ON articles.id = page_views.article_id
WHERE articles.user_id = 13 AND articles.published = TRUE
GROUP BY articles.id, articles.title;

This is a quick verification that it's working from my local copy:

PracticalDeveloper_development> SELECT articles.id, articles.title, SUM(page_views.counts_for_number_of_views) AS total_views
 FROM articles
 LEFT JOIN page_views ON articles.id = page_views.article_id
 WHERE articles.user_id = 13 AND articles.published = TRUE
 GROUP BY articles.id, articles.title;

+------+--------------+---------------+
| id   | title        | total_views   |
|------+--------------+---------------|
| 26   | test title   | 7             |
| 27   | Test         | 2             |
| 28   | Test with v2 | 2             |
| 29   | title        | 3             |
+------+--------------+---------------+
SELECT 4
Time: 0.062s
PracticalDeveloper_development> select SUM(page_views.counts_for_number_of_views) from page_views where article_id = 26;
+-------+
| sum   |
|-------|
| 7     |
+-------+
SELECT 1
Time: 0.017s

Since they are in the same table we need to resort to either subqueries as you've done or we need to unearth filtered aggregates. What are they?

As we've seen in this query above aggregates expression enable the database to compute an aggregate on a list of selected rows (ie. counting all likes that have the requested article id). A filtered aggregate allows you to ask for an aggregate but filter on some condition. Let's try:

SELECT articles.id, articles.title,
  COUNT(reactions.id) FILTER (WHERE reactions.category = 'like') AS total_likes
FROM articles
LEFT JOIN reactions ON reactions.reactable_id = articles.id AND reactions.reactable_type = 'Article'
WHERE articles.user_id = 13 AND articles.published = TRUE
GROUP BY articles.id, articles.title;

Example result:

PracticalDeveloper_development> SELECT articles.id, articles.title,
   COUNT(reactions.id) FILTER (WHERE reactions.category = 'like') AS total_likes
 FROM articles
 LEFT JOIN reactions ON reactions.reactable_id = articles.id AND reactions.reactable_type = 'Article'
 WHERE articles.user_id = 13 AND articles.published = TRUE
 GROUP BY articles.id, articles.title;
+------+--------------+---------------+
| id   | title        | total_likes   |
|------+--------------+---------------|
| 26   | test title   | 1             |
| 27   | Test         | 0             |
| 28   | Test with v2 | 0             |
| 29   | title        | 0             |
+------+--------------+---------------+

With this in mind we can rewrite the original query like this:

SELECT articles.id,
       articles.title,
       articles.published_at,
       articles.positive_reactions_count AS total_reactions,
       articles.cached_tag_list,
       collections.slug AS series_title,
       COUNT(reactions.id) FILTER (
                                   WHERE reactions.category = 'like') AS total_likes,
       COUNT(reactions.id) FILTER (
                                   WHERE reactions.category = 'unicorn') AS total_unicorns,
       COUNT(reactions.id) FILTER (
                                   WHERE reactions.category = 'readinglist') AS total_bookmarks,
       SUM(page_views.counts_for_number_of_views) AS total_views,
       AVG(page_views.time_tracked_in_seconds) AS average_time_spent_on_article
FROM articles
LEFT JOIN collections ON collections.id = articles.collection_id
LEFT JOIN reactions ON reactions.reactable_id = articles.id
AND reactions.reactable_type = 'Article'
LEFT JOIN page_views ON page_views.article_id = articles.id
WHERE articles.user_id = 13
    AND articles.published = TRUE
GROUP BY articles.id,
         articles.title,
         articles.published_at,
         articles.positive_reactions_count,
         articles.cached_tag_list,
         collections.slug;

This is the result:

+------+--------------+----------------------------+-------------------+-------------------+----------------+---------------+------------------+-------------------+---------------+---------------------------------+
| id   | title        | published_at               | total_reactions   | cached_tag_list   | series_title   | total_likes   | total_unicorns   | total_bookmarks   | total_views   | average_time_spent_on_article   |
|------+--------------+----------------------------+-------------------+-------------------+----------------+---------------+------------------+-------------------+---------------+---------------------------------|
| 26   | test title   | 2019-04-08 11:01:37.354794 | 0                 | tags              | <null>         | 7             | 0                | 0                 | 7             | 192.8571428571428571            |
| 27   | Test         | 2019-04-10 10:50:43.777291 | 0                 | test              | <null>         | 0             | 0                | 0                 | 2             | 97.5000000000000000             |
| 28   | Test with v2 | 2019-04-10 10:51:26.636466 | 0                 | test              | <null>         | 0             | 0                | 0                 | 2             | 15.0000000000000000             |
| 29   | title        | 2019-04-10 19:57:58.779827 | 0                 |                   | <null>         | 0             | 0                | 0                 | 3             | 15.0000000000000000             |
+------+--------------+----------------------------+-------------------+-------------------+----------------+---------------+------------------+-------------------+---------------+---------------------------------+

As we you can see all of the user's article are there, this is what happens if instead of using a LEFT JOIN I use a normal join with the collections:

+------+---------+----------------+-------------------+-------------------+----------------+---------------+------------------+-------------------+---------------+---------------------------------+
| id   | title   | published_at   | total_reactions   | cached_tag_list   | series_title   | total_likes   | total_unicorns   | total_bookmarks   | total_views   | average_time_spent_on_article   |
|------+---------+----------------+-------------------+-------------------+----------------+---------------+------------------+-------------------+---------------+---------------------------------|
+------+---------+----------------+-------------------+-------------------+----------------+---------------+------------------+-------------------+---------------+---------------------------------+
SELECT 0

Since I have no articles present in collections. I have no results.

Let's see the query plan:

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                                                                              |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| GroupAggregate  (cost=36.45..36.49 rows=1 width=163) (actual time=0.188..0.194 rows=4 loops=1)                                                                          |
|   Group Key: articles.id, collections.slug                                                                                                                              |
|   ->  Sort  (cost=36.45..36.45 rows=1 width=143) (actual time=0.174..0.175 rows=14 loops=1)                                                                             |
|         Sort Key: articles.id, collections.slug                                                                                                                         |
|         Sort Method: quicksort  Memory: 26kB                                                                                                                            |
|         ->  Nested Loop Left Join  (cost=12.34..36.44 rows=1 width=143) (actual time=0.077..0.162 rows=14 loops=1)                                                      |
|               Join Filter: (page_views.article_id = articles.id)                                                                                                        |
|               Rows Removed by Join Filter: 154                                                                                                                          |
|               ->  Nested Loop Left Join  (cost=12.34..34.58 rows=1 width=135) (actual time=0.067..0.101 rows=4 loops=1)                                                 |
|                     Join Filter: (reactions.reactable_id = articles.id)                                                                                                 |
|                     Rows Removed by Join Filter: 3                                                                                                                      |
|                     ->  Hash Right Join  (cost=8.17..23.27 rows=1 width=99) (actual time=0.046..0.050 rows=4 loops=1)                                                   |
|                           Hash Cond: (collections.id = articles.collection_id)                                                                                          |
|                           ->  Seq Scan on collections  (cost=0.00..13.70 rows=370 width=36) (actual time=0.004..0.004 rows=0 loops=1)                                   |
|                           ->  Hash  (cost=8.15..8.15 rows=1 width=71) (actual time=0.028..0.028 rows=4 loops=1)                                                         |
|                                 Buckets: 1024  Batches: 1  Memory Usage: 9kB                                                                                            |
|                                 ->  Index Scan using index_articles_on_user_id on articles  (cost=0.14..8.15 rows=1 width=71) (actual time=0.018..0.023 rows=4 loops=1) |
|                                       Index Cond: (user_id = 13)                                                                                                        |
|                                       Filter: published                                                                                                                 |
|                     ->  Bitmap Heap Scan on reactions  (cost=4.17..11.28 rows=3 width=40) (actual time=0.008..0.008 rows=1 loops=4)                                     |
|                           Recheck Cond: ((reactable_type)::text = 'Article'::text)                                                                                      |
|                           Heap Blocks: exact=4                                                                                                                          |
|                           ->  Bitmap Index Scan on index_reactions_on_reactable_type  (cost=0.00..4.17 rows=3 width=0) (actual time=0.006..0.006 rows=1 loops=4)        |
|                                 Index Cond: ((reactable_type)::text = 'Article'::text)                                                                                  |
|               ->  Seq Scan on page_views  (cost=0.00..1.38 rows=38 width=16) (actual time=0.003..0.007 rows=42 loops=4)                                                 |
| Planning Time: 0.576 ms                                                                                                                                                 |
| Execution Time: 0.332 ms                                                                                                                                                |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I would stop here, I don't have a huge dataset to test so you might have to run this second query and see for yourself (let me know if everything works).

rhymes • Apr 19 '19 • Edited

(I just noticed there's a bug in the GROUP BY function so I suggest using the subselects at the top of my comment in the meantime. I'll fix it later :D)

rhymes • Apr 19 '19

Found it, I missed the DISTINCT. So, the final queries:

With sub selects

PracticalDeveloper_development> SELECT articles.id,
        articles.title,
        articles.published_at,
        articles.positive_reactions_count AS total_reactions,
        articles.cached_tag_list,
        collections.slug AS series_title,

     (SELECT COUNT(reactions.id)
      from reactions
      where reactions.category = 'like'
          AND reactions.reactable_id = articles.id
          AND reactions.reactable_type = 'Article') AS total_likes,

     (SELECT COUNT(reactions.id)
      from reactions
      where reactions.category = 'unicorn'
          AND reactions.reactable_id = articles.id
          AND reactions.reactable_type = 'Article') AS total_unicorns,

     (SELECT COUNT(reactions.id)
      from reactions
      where reactions.category = 'readinglist'
          AND reactions.reactable_id = articles.id
          AND reactions.reactable_type = 'Article') AS total_bookmarks,

     (SELECT SUM(page_views.counts_for_number_of_views)
      from page_views
      where page_views.article_id = articles.id) AS total_views,

     (SELECT AVG(page_views.time_tracked_in_seconds)
      from page_views
      where page_views.article_id = articles.id) AS average_time_spent_on_article
 FROM articles
 LEFT JOIN collections ON collections.id = articles.collection_id
 WHERE articles.user_id = 13
     AND articles.published = TRUE;
+------+--------------+----------------------------+-------------------+-------------------+----------------+---------------+------------------+-------------------+---------------+---------------------------------+
| id   | title        | published_at               | total_reactions   | cached_tag_list   | series_title   | total_likes   | total_unicorns   | total_bookmarks   | total_views   | average_time_spent_on_article   |
|------+--------------+----------------------------+-------------------+-------------------+----------------+---------------+------------------+-------------------+---------------+---------------------------------|
| 26   | test title   | 2019-04-08 11:01:37.354794 | 0                 | tags              | <null>         | 1             | 0                | 0                 | 7             | 192.8571428571428571            |
| 27   | Test         | 2019-04-10 10:50:43.777291 | 0                 | test              | <null>         | 0             | 0                | 0                 | 2             | 97.5000000000000000             |
| 28   | Test with v2 | 2019-04-10 10:51:26.636466 | 0                 | test              | <null>         | 0             | 0                | 0                 | 2             | 15.0000000000000000             |
| 29   | title        | 2019-04-10 19:57:58.779827 | 0                 |                   | <null>         | 0             | 0                | 0                 | 3             | 15.0000000000000000             |
+------+--------------+----------------------------+-------------------+-------------------+----------------+---------------+------------------+-------------------+---------------+---------------------------------+
SELECT 4

With GROUP by

PracticalDeveloper_development> SELECT articles.id,
        articles.title,
        articles.published_at,
        articles.positive_reactions_count AS total_reactions,
        articles.cached_tag_list,
        collections.slug AS series_title,
        COUNT(DISTINCT reactions.id) FILTER (
                                             WHERE reactions.category = 'like') AS total_likes,
        COUNT(DISTINCT reactions.id) FILTER (
                                             WHERE reactions.category = 'unicorn') AS total_unicorns,
        COUNT(DISTINCT reactions.id) FILTER (
                                             WHERE reactions.category = 'readinglist') AS total_bookmarks,
        SUM(page_views.counts_for_number_of_views) AS total_views,
        AVG(page_views.time_tracked_in_seconds) AS average_time_spent_on_article
 FROM articles
 LEFT JOIN collections ON collections.id = articles.collection_id
 LEFT JOIN reactions ON reactions.reactable_id = articles.id
 AND reactions.reactable_type = 'Article'
 LEFT JOIN page_views ON page_views.article_id = articles.id
 WHERE articles.user_id = 13
     AND articles.published = TRUE
 GROUP BY articles.id,
          articles.title,
          articles.published_at,
          articles.positive_reactions_count,
          articles.cached_tag_list,
          collections.slug;
+------+--------------+----------------------------+-------------------+-------------------+----------------+---------------+------------------+-------------------+---------------+---------------------------------+
| id   | title        | published_at               | total_reactions   | cached_tag_list   | series_title   | total_likes   | total_unicorns   | total_bookmarks   | total_views   | average_time_spent_on_article   |
|------+--------------+----------------------------+-------------------+-------------------+----------------+---------------+------------------+-------------------+---------------+---------------------------------|
| 26   | test title   | 2019-04-08 11:01:37.354794 | 0                 | tags              | <null>         | 1             | 0                | 0                 | 7             | 192.8571428571428571            |
| 27   | Test         | 2019-04-10 10:50:43.777291 | 0                 | test              | <null>         | 0             | 0                | 0                 | 2             | 97.5000000000000000             |
| 28   | Test with v2 | 2019-04-10 10:51:26.636466 | 0                 | test              | <null>         | 0             | 0                | 0                 | 2             | 15.0000000000000000             |
| 29   | title        | 2019-04-10 19:57:58.779827 | 0                 |                   | <null>         | 0             | 0                | 0                 | 3             | 15.0000000000000000             |
+------+--------------+----------------------------+-------------------+-------------------+----------------+---------------+------------------+-------------------+---------------+---------------------------------+
SELECT 4

As you can see, same result.

Hope this helps Andy!

Andy Zhao (he/him) • Apr 19 '19 • Edited

Wow Rhymes, thanks for such an excellent answer!!! The query plans really help to see how they are performing. Interesting that FILTER allows you to use aggregates much faster than a subquery. Definitely have to read a book or two about SQL now!

In case you're wondering, getting 119 rows took about 4 minutes to run the GROUP BY query. We're not looking to put in this app by any means, so the time is totally fine. Definitely helps me think about how to structure AnalyticsService.rb queries though.

rhymes • Apr 19 '19

Glad it helped!

Interesting that FILTER allows you to use aggregates much faster than a subquery.

I think it's because the DB is able to do it with less work by using multiple conditions on the same set of data (the joined table on the right of the join, reactions in this example), whereas with a subquery, it has to execute a different subquery for each row in the articles table. Not much to optimize there.

In case you're wondering, getting 119 rows took about 4 minutes to run the GROUP BY query.

Mmm not great though :D how long was it taking with the subqueries? Anyhow, it's something. I have a feeling it can be further optimized by checking if and which indexes are being used and the other idea I had (untested and unexplored) was to use window functions but I haven't tried, nor I know if it can be done or if it's faster :D

We're not looking to put in this app by any means, so the time is totally fine

Good :D

Definitely helps me think about how to structure AnalyticsService.rb queries though

Let me know if you find a different approach for that. After all you have the great advantage of having a working version that you can use as a comparison. You extract a set of data and then play with grouping until you get the same result. :D