DEV Community

Cover image for A Second Walk Through of Composing a SQL Query
Jeremy Friesen for The DEV Team

Posted on • Originally published at takeonrules.com on

A Second Walk Through of Composing a SQL Query

Showing my Work as I Query a Forem Instance

This builds on work from my Walk Through of Using Postgresql and Blazer to Generate a Cohort Report.

The query I’ll be building helps answer what percentage of active users commented on at least one welcome article. For this query, an active user is someone who’s been on the site 4 of the last 7 days.

Establishing the Entity Relationship Model

Again, when writing SQL, I like to start with a relationship diagram. In this data foray, we have four relevant tables.

class User
  has_many :articles
  has_many :comments
end

class Article
  belongs_to :user
  has_many :comments, as: :commentable
end

class Comment
  belongs_to :commentable, polymorphic: true
  belongs_to :user
end

class PageView
  belongs_to :article
  belongs_to :user, optional: true
end

Enter fullscreen mode Exit fullscreen mode

Below is a diagram for those of you who prefer an Entity Relationship Model (ERM) of the four data models.

An ERM of the tables used in these queries.

An Entity Relationship Model of Users, Articles, Comments, and Page Views

Querying All Active Members

I want all users who have page_views on at least 4 of the last seven days; we’ll consider these “active users.”

First I want to build a very narrow query; one that lets me make sure I know I’m on the right path. I’ll limit the page views to my user_id:

SELECT user_id, extract(isodow from created_at) AS day_of_week
FROM page_views
WHERE page_views.user_id = 702612
      AND page_views.created_at > CURRENT_DATE - 7
GROUP BY page_views.user_id, day_of_week
Enter fullscreen mode Exit fullscreen mode

The following query times out; It’s trying to query all users.

SELECT dow.user_id,
       count(dow.day_of_week) AS number_of_days
FROM (
  SELECT user_id,
         extract(isodow from created_at) AS day_of_week
  FROM page_views
  WHERE page_views.created_at > CURRENT_DATE - 7
    AND user_id IS NOT NULL
  GROUP BY page_views.user_id, day_of_week
) AS dow
GROUP BY dow.user_id
HAVING count(dow.day_of_week) >= 4
Enter fullscreen mode Exit fullscreen mode

Because of the enormity of the page views we need to limit to only recently updated users. The following is the query to get recent users.

SELECT id
FROM users
WHERE updated_at
      > CURRENT_DATE - 7
Enter fullscreen mode Exit fullscreen mode

The following query is the foundational “Who are the current active users of Forem.”

SELECT DISTINCT dow.user_id FROM (
  SELECT users.id AS user_id,
    extract(isodow from page_views.created_at) AS day_of_week
  FROM users
  INNER JOIN page_views
    ON page_views.user_id = users.id
    AND page_views.created_at > CURRENT_DATE - 7
    AND user_id IS NOT NULL
  -- Extend the window for users just a bit to account --
  -- for timing variance --
  WHERE users.updated_at
        >  NOW()::date - INTERVAL '8 day'
  GROUP BY users.id, day_of_week) AS dow
GROUP BY dow.user_id
HAVING count(dow.day_of_week) >= 4
Enter fullscreen mode Exit fullscreen mode

I “saved” the above query to https://dev.to/admin/blazer/queries/717-regular-and-active-recent-users-of-dev. We now have our “who’s the currently active users of DEV.to” query.

Querying Active Users Who Have Commented on a Welcome Post

The next part is to work out who all commented on a welcome post. In Walk Through of Using Postgresql and Blazer to Generate a Cohort Report, I wrote about finding the users who had commented on the welcome article.

However, I need to adjust the cohort query; I only want users who commented on the welcome post. The cohort query has users who did and did not comment on the welcome post.

As a quick reminder, the result of the following query is all user_id that commented on a welcome post; but with a limitation on the user’s updated_at

SELECT DISTINCT comments.user_id AS user_id
FROM comments
INNER JOIN users
  ON comments.user_id = users.id
    -- Extend the window for users just a bit to --
    -- account for timing variance --
    AND users.updated_at > CURENT_DATE - 8
INNER JOIN articles
  ON comments.commentable_id = articles.id
    AND comments.commentable_type = 'Article'
    AND articles.title LIKE 'Welcome Thread - v%'
    AND articles.published = true
    AND articles.user_id = 3
GROUP BY comments.user_id
Enter fullscreen mode Exit fullscreen mode

Now to meld the two queries. I’m using the Postgresql WITH statement to create two queries that I can reference later on. I find the WITH statement to help “encapsulate” queries and hopefully make them more conceptually understandable.

WITH cow AS (
  -- User IDs of recent folks who have commented on the
  -- welcome threads --
  SELECT DISTINCT comments.user_id AS user_id
  FROM comments
  INNER JOIN users ON comments.user_id = users.id
    -- Extend the window for users just a bit to account --
    -- for timing variance --
    AND users.updated_at > CURENT_DATE - 8
  INNER JOIN articles
    ON comments.commentable_id = articles.id
    AND comments.commentable_type = 'Article'
    AND articles.title LIKE 'Welcome Thread - v%'
    AND articles.published = true
    AND articles.user_id = 3
  GROUP BY comments.user_id
), dow AS (
  -- User IDs of folks who have interacted at least 4 different
  -- days of this week --
  SELECT user_id FROM (
    SELECT users.id AS user_id,
      extract(isodow from page_views.created_at) AS day_of_week
    FROM users
    INNER JOIN page_views
      ON page_views.user_id = users.id
      AND page_views.created_at > CURRENT_DATE - 7
      AND user_id IS NOT NULL
    -- Extend the window for users just a bit to account for
    -- timing variance --
    WHERE users.updated_at > CURRENT_DATE - 8
    GROUP BY users.id, day_of_week
  ) AS dows
  GROUP BY user_id
  HAVING COUNT(day_of_week) >= 4
)

SELECT COUNT(dow.user_id) AS count_of_users,
  (
    SELECT COUNT(*)
    FROM dow
    INNER JOIN cow
      ON cow.user_id = dow.user_id
  ) AS count_of_users_that_said_hello
FROM dow
Enter fullscreen mode Exit fullscreen mode

Conclusion

When I was writing the last query, I kept getting a result that said every active user on the site had commented on a welcome article. I didn’t trust that result; it seemed highly improbable. I revisited my queries and logic, found my error, reworked the query and got a more reasonable answer.

What was wrong? I had copied and pasted a query from Walk Through of Using Postgresql and Blazer to Generate a Cohort Report. But that query wasn’t the right thing to ask. It does highlight one challenge of SQL; it can be hard to test the correctness of your query.

Discussion (5)

Collapse
geraldew profile image
geraldew

I have to admit, I'm puzzled by your double layer query - shown after "The following query times out"

My impression is that you could just do the following (and as it's not my familiar dialect, I had to check that PostgreSQL supports both CURRENT_DATE and DISTINCT ).

SELECT 
    user_id,
    count( DISTINCT created_at) AS created_at_ctd
FROM 
    page_views
WHERE 
    created_at > CURRENT_DATE - 7
    AND 
    user_id IS NOT NULL
GROUP BY 
    user_id
HAVING 
    created_at_ctd >= 4
;
Enter fullscreen mode Exit fullscreen mode

As you have limited the date span to the last 7 days, there is no particular need to derive what days of the week those are, as each date will be a different one anyway.

Admittedly this begs the question of which data interpretation you'd want for a longer date span, say over two weeks - i.e. whether you would want "different days" or "different days of the week".

p.s. it's cute that you've quoted the Ruby classes but it means nothing to those of us who work in SQL. I've assumed that created_at is a date column for display simplicity, but if it's a timestamp then some additional wrapping will be necessary (probably just a CAST to DATE).

Collapse
jeremyf profile image
Jeremy Friesen Author

I tried the following and it appears to get me the answer:

SELECT 
    user_id,
    count( DISTINCT  extract(isodow from created_at)) AS created_at_ctd
FROM 
    page_views
WHERE 
    created_at > CURRENT_DATE - 7
    AND 
    user_id IS NOT NULL
GROUP BY 
    user_id
HAVING 
     count( DISTINCT  extract(isodow from created_at)) >= 4
;
Enter fullscreen mode Exit fullscreen mode

So thank you for helping me further simplify this thing.

The created_at, as a Ruby on Rails idiom is a timestamp.

I included the Ruby as it's reasonably accessible from a description (though I could make it easier).

Collapse
jeremyf profile image
Jeremy Friesen Author

I also think I had an off by one error in my query that your fix updates. So a double thank you.

Collapse
chrisgreening profile image
Chris Greening

What a fun walkthrough - it's really important to show how code is applied to real world scenarios and this tutorial did a great job of doing that. Thanks for posting!

Collapse
bobbyiliev profile image
Bobby Iliev

Very cool!