DEV Community

Jeremy Friesen for The DEV Team

Posted on

Digging into Postgresql and DEV

Earlier I was reviewing the draft Community Wellness badge pull request; and with my head deep in SQL these days, I thought I’d give a go at crafting a query to create this logic.

The following query finds the user IDs and weeks since today in which a user has written at least two comments that don’t have a negative moderator reaction.

  • user_id: The user’s database ID.
  • weeks_ago: The number of weeks (since today) in which we’re grouping comments
  • number_of_comments_with_positive_reaction: How many positive reaction comments did they have for the weeks_ago.
SELECT user_id,
    COUNT(user_id) as number_of_comments_with_positive_reaction,
    /* Get the number of weeks, since today for posts */
    (trunc((extract(epoch FROM (current_timestamp- created_at))) / 604800)) AS weeks_ago
FROM comments
/* Only select comments from the last 32 weeks that
   don't have a negative moderator reaction */
INNER JOIN
    /* Find all comments in the last 32 weeks */
    (SELECT DISTINCT reactable_id
    FROM reactions
    WHERE reactable_type = 'Comment'
    AND created_at > (now() - interval '224' day)
    /* Omit any comments that got a negative moderator reaction */
    EXCEPT
      SELECT DISTINCT reactable_id
      FROM reactions
      WHERE reactable_type = 'Comment'
      AND created_at > (now() - interval '224' day)
      AND category IN ('thumbsdown', 'vomit')) AS positve_reactions
    ON comments.id = positve_reactions.reactable_id
INNER JOIN
    /* Find the users who have at least two comments in the last week */
    (SELECT count(id) AS number_of_comments,
        user_id AS comment_counts_user_id
      FROM comments
      WHERE created_at >= (now() - interval '7' day)
      GROUP BY user_id) AS comment_counts
      ON comments.user_id = comment_counts_user_id
      AND comment_counts.number_of_comments > 1
/* Don’t select anything older than 224 days ago, or 32 weeks ago */
WHERE created_at > (now() - interval '224' day)
GROUP BY user_id, weeks_ago
Enter fullscreen mode Exit fullscreen mode

The above query creates multiple rows per user id. Which is fine, but if you want to loop through things, you’ll need to bust out some temporary variable magic.

I was wondering if I'd be able to get this down to one query. With the help of some folks at Forem, I wrote the following query aggregates that information for you; but you need to do some assembly work.

The columns are:

  • user_id: The user’s database ID.
  • serialized_weeks_ago: A comma separated list of the weeks in which we had comments.
  • weeks_ago_array: An array of integers that is the non-string representation of serialized_weeks_ago; we want to see how ActiveRecord handles this array of integers. It's a the simpler version of the serialized_weeks_ago.
  • serialized_comment_counts: A comma separated list of the number of comments.

The first number of the serialized_weeks_ago maps to the first number of the serialized_comment_counts. And you get one row per user.

SELECT user_id,
       /* A comma separated string of "weeks_ago" */
       array_to_string(array_agg(weeks_ago), ',') AS serialized_weeks_ago,
       /* Will active record convert this to an array of integers? */
       array_agg(weeks_ago) AS weeks_ago_array,
       /* A comma separated string of comment counts.  The first value in this string
       happens on the week that is the first value in serialized_weeks_ago */
       array_to_string(array_agg(number_of_comments_with_positive_reaction), ',') AS serialized_comment_counts
FROM
/* This is the same query as the first example query */
(SELECT user_id,
    COUNT(user_id) as number_of_comments_with_positive_reaction,
    /* Get the number of weeks, since today for posts */
    (trunc((extract(epoch FROM (current_timestamp- created_at))) / 604800)) AS weeks_ago
FROM comments
/* Only select comments from the last 32 weeks that
   don't have a negative moderator reaction */
INNER JOIN
    /* Find all comments in the last 32 weeks */
    (SELECT DISTINCT reactable_id
    FROM reactions
    WHERE reactable_type = 'Comment'
    AND created_at > (now() - interval '224' day)
    /* Omit any comments that got a negative moderator reaction */
    EXCEPT
      SELECT DISTINCT reactable_id
      FROM reactions
      WHERE reactable_type = 'Comment'
      AND created_at > (now() - interval '224' day)
      AND category IN ('thumbsdown', 'vomit')) AS positve_reactions
    ON comments.id = positve_reactions.reactable_id
INNER JOIN
    /* Find the users who have at least two comments in the last week */
    (SELECT count(id) AS number_of_comments,
        user_id AS comment_counts_user_id
      FROM comments
      WHERE created_at >= (now() - interval '7' day)
      GROUP BY user_id) AS comment_counts
      ON comments.user_id = comment_counts_user_id
      AND comment_counts.number_of_comments > 1
/* Don’t select anything older than 224 days ago, or 32 weeks ago */
WHERE created_at > (now() - interval '224' day)
GROUP BY user_id, weeks_ago
) AS user_comment_counts_by_week GROUP BY user_id
Enter fullscreen mode Exit fullscreen mode

I am eager to share these Postgresql approaches as they can help circumvent running lots of smaller queries. I also had the chance to pair up with two folks to make sure we wrote the correct logic and it was performant enough.

Discussion (7)

Collapse
wiseai profile image
Mahmoud Harmouch • Edited on

This is a good starting point. But I think you can improve it by avoiding correlated subqueries(user_id appears in the inner and outer tables); your query will result in the following:

SELECT 
  user_comment_counts_by_week.user_id, 
  array_to_string(
    array_agg(weeks_ago), 
    ','
  ) AS serialized_weeks_ago, 
  array_agg(weeks_ago) AS weeks_ago_array, 
  array_to_string(
    array_agg(
      number_of_comments_with_positive_reaction
    ), 
    ','
  ) AS serialized_comment_counts 
FROM 
  (
    SELECT 
      comment_counts.user_id, 
      COUNT(comment_counts.user_id) AS number_of_comments_with_positive_reaction, 
      (
        trunc(
          (
            EXTRACT(
              epoch 
              FROM 
                (current_timestamp - created_at)
            )
          ) / 604800
        )
      ) AS weeks_ago 
    FROM 
      comments 
      INNER JOIN (
        SELECT 
          DISTINCT reactions.reactable_id 
        FROM 
          reactions 
        WHERE 
          reactions.reactable_type = 'Comment' 
          AND reactions.created_at > (now() - INTERVAL '224' day) 
        EXCEPT 
        SELECT 
          DISTINCT reactions.reactable_id 
        FROM 
          reactions 
        WHERE 
          reactions.reactable_type = 'Comment' 
          AND reactions.created_at > (now() - INTERVAL '224' day) 
          AND reactions.category IN ('thumbsdown', 'vomit')
      ) AS positve_reactions ON comments.id = positve_reactions.reactable_id 
      INNER JOIN (
        SELECT 
          count(comments.id) AS number_of_comments, 
          comments.user_id AS comment_counts_user_id 
        FROM 
          comments 
        WHERE 
          comments.created_at >= (now() - INTERVAL '7' day) 
        GROUP BY 
          comments.user_id
      ) AS comment_counts ON comments.user_id = comment_counts_user_id 
      AND comment_counts.number_of_comments > 1 
    WHERE 
      created_at > (now() - INTERVAL '224' day) 
    GROUP BY 
      comment_counts.user_id, 
      weeks_ago
  ) AS user_comment_counts_by_week 
GROUP BY 
  user_comment_counts_by_week.user_id
Enter fullscreen mode Exit fullscreen mode

I didn't run it cause I don't have the data to test it, but I think it should do the trick. Before running that query, you can create indexes for faster data retrieval:

CREATE INDEX comments_idx_user_id ON "comments" ("user_id");
CREATE INDEX comments_idx_created_at ON "comments" ("created_at");
CREATE INDEX reactions_idx_reactable_type_created_at ON "reactions" ("reactable_type","created_at");
Enter fullscreen mode Exit fullscreen mode

I am looking forward to knowing the result of the above queries.

Collapse
jeremyf profile image
Jeremy Friesen Author

I transcribed the above query into the following and ran it locally:

SELECT comment_counts_user_id,
       array_to_string(array_agg(weeks_ago),',') AS serialized_weeks_ago,
       array_agg(weeks_ago) AS weeks_ago_array,
       array_to_string(array_agg(number_of_comments_with_positive_reaction),',') AS serialized_comment_counts 
FROM (SELECT
    comment_counts_user_id,
        COUNT(comment_counts_user_id) AS number_of_comments_with_positive_reaction,
        (trunc((EXTRACT(epoch FROM (current_timestamp - created_at))) / 604800)) AS weeks_ago 
    FROM comments 
        INNER JOIN (
                SELECT DISTINCT reactions.reactable_id 
                FROM reactions 
                WHERE reactions.reactable_type = 'Comment' 
                    AND reactions.created_at > (now() - INTERVAL '224' day)
        EXCEPT SELECT DISTINCT reactions.reactable_id 
        FROM reactions 
        WHERE reactions.reactable_type = 'Comment' 
                    AND reactions.created_at > (now() - INTERVAL '224' day) 
            AND reactions.category IN ('thumbsdown', 'vomit')
            )
        AS positve_reactions 
            ON comments.id = positve_reactions.reactable_id 
    INNER JOIN (
                SELECT count(comments.id) AS number_of_comments,
                       comments.user_id AS comment_counts_user_id 
                FROM comments 
                WHERE comments.created_at >= (now() - INTERVAL '7' day) 
                GROUP BY comments.user_id
                )
        AS comment_counts 
                ON comments.user_id = comment_counts_user_id 
                AND comment_counts.number_of_comments > 1 
    WHERE created_at > (now() - INTERVAL '224' day) 
    GROUP BY comment_counts_user_id, weeks_ago)
AS user_comment_counts_by_week 
GROUP BY comment_counts_user_id
Enter fullscreen mode Exit fullscreen mode

The query you provided EXPLAIN ANALYZE had an almost negligible slow-down but appears to be generally as performant.

The indexes we have are in github.com/forem/forem/blob/8b37e3.... We don't have the multi-column index but each of those columns are indexed.

Collapse
wiseai profile image
Mahmoud Harmouch

That looks neat. Now, I am wondering if it is possible to do data partitioning(using the OVER clause instead of GROUP BY). What do you think?

Thread Thread
jeremyf profile image
Jeremy Friesen Author

I haven't used the OVER clause before. Looking at the docs now.

Thread Thread
wiseai profile image
Mahmoud Harmouch

It is something beyond our understanding. I feel like aliens have contacted us after inventing this clause. It is pretty funny to mention that I totally forget how it works whenever I decide to use it. Oh my! My brain hurts.

Thread Thread
jeremyf profile image
Jeremy Friesen Author

It is pretty funny to mention that I totally forget how it works whenever I decide to use it. Oh my! My brain hurts.

Any sufficiently advanced technology is indistinguishable from magic.

Thread Thread
wiseai profile image
Mahmoud Harmouch

So true!