Jeremy Friesen

Posted on Oct 28, 2021 • Originally published at takeonrules.com on Oct 28, 2021

Practicing Postgresql and Postulating (Im)Provements

#postgres #programming

Returning Again to Databases

For years, I wasn’t working with relational databases. But in my heart, I’ve always loved them and thrived when developing with them.

What follows is my iterating through some SQL statements to both better understand the database and also the content. I’ve been using Blazer, as written about on Dev.to by Molly. As part of my job, I have access to the DEV.to environment to run these queries.

The Reigning Champion

One task in front of me is to improve on the Feed Algorithm. To understand that, I need to understand the data. In the current incarnations we rely heavily on two scores:

article.score
article.hotness_score

To understand the algorithms, I need to look at those values. Here’s a query to understand the possible range of scores:

SELECT
  min(score),
  max(score),
  avg(score),
  stddev(score),
  percentile_cont(0.5) within GROUP(ORDER BY score) mean,
  percentile_cont(0.95) within GROUP(ORDER BY score) percentile_95
FROM articles;

The results:

min: -3,019
max: 12,199
avg: 8.92
stddev: 116.82
mean: 0.0
percentile_95: 48.0

There’s a lot of variance in the distribution of the score, and anything we add to that score for sorting will be inconsistent.

SELECT
  min(hotness_score),
  max(hotness_score),
  avg(hotness_score),
  stddev(hotness_score),
  percentile_cont(0.5) within GROUP(ORDER BY hotness_score) mean,
  percentile_cont(0.95) within GROUP(ORDER BY hotness_score) percentile_95
FROM articles;

min: -1,355,719
max: 579,658
avg: 126,604.84
stddev: 216,433.25
mean: 0.0
percentile_95: 364,979

And the situation is even more swingy with the hotness score. The underlying implementation works for smaller communities, but as DEV.to grows, the swingyness becomes more and more notable.

The particular challenge is that the score or hotness score is the starting point for all feed queries. In otherwords, to get a feed tailored for the user, there’s a lot of variance to overcome.

So I decided to approach from a different angle. What I have is incomplete, but I’m considering the structure as a strong competetor to compete against the current feed algorithm.

And in This Corner

First, I wanted to build some successes by writing some quick and easy queries. You know, practice a little bit. In the examples, I have :user_id as the placeholder for my personal DEV.to account number.

The following query is published articles that exclude those published by people I’ve blocked:

SELECT articles.id, title, published_at
FROM articles
LEFT OUTER JOIN user_blocks ON user_blocks.blocked_id = articles.user_id AND user_blocks.blocked_id IS NULL AND user_blocks.blocker_id = :user_id
WHERE published = true
ORDER BY published_at DESC
LIMIT 20;

Select the tags that I follow:

SELECT tags.*
FROM tags
INNER JOIN follows ON tags.id = follows.followable_id
AND follows.followable_type = 'ActsAsTaggableOn::Tag'
WHERE follows.follower_id = :user_id AND follows.follower_type = 'User';

Select the organizations that I follow:

SELECT organizations.*
FROM organizations
INNER JOIN follows
ON organizations.id = follows.followable_id
AND follows.followable_type = 'Organization'
WHERE follows.follower_id = :user_id AND follows.follower_type = 'User';

Select the users that I follow:

SELECT users.*
FROM users
INNER JOIN follows
ON users.id = follows.followable_id
AND follows.followable_type = 'User'
WHERE follows.follower_id = :user_id AND follows.follower_type = 'User';

Then I started doing computations in query. This query creates a freshness score based solely on the publication date and excluding articles from those people I blocked:

SELECT
  (1 - (current_date - published_at::date) * 0.1) AS freshness_score,
  articles.title,
  STRING_AGG(tags.name, ', ') AS tag_names
FROM articles
LEFT OUTER JOIN user_blocks
  ON user_blocks.blocked_id = articles.user_id
    AND user_blocks.blocked_id IS NULL
    AND user_blocks.blocker_id = :user_id
LEFT OUTER JOIN taggings
  ON taggings.taggable_id = articles.id
    AND taggable_type = 'Article'
INNER JOIN tags
  ON taggings.tag_id = tags.id
WHERE published = true
GROUP BY freshness_score, articles.title
ORDER BY freshness_score DESC
LIMIT 20

You can see the query in Blazer.

Next, I started thinking about what I wanted out of this. My initial brainstorm was that I want to create a scalar value for an article:

If the article has one tag that I follow, I want to have a 0.8 multiplier.
If it has two or more tags that I follow, I want it to have a 1.0 multiplier.
If it has no articles that I follow, I want it to have a 0.1 multiplier.

Using the above, I then want to select all articles that were published within the last 7 days. Articles published today have a score of 1. All other articles have a score of 1 - ((number of days since published) * 0.1). I want to multiply the above scalar value for the article by the score to create a confidence value. And I want the query sorted from highest confidence value to lowest.

The Challenger

In wanted a query that tells me how many of an article’s tags I follow of the top published articles.

The algorithm embedded in the following SQL statement is as follows:

The highest relevance score is 1.
Anything posted prior to today has a lower cap. One day old is 0.9, two days old is 0.8, etc.
If I don’t follow any of the tags, I’m not overly interested. The more tags that I follow that are part on the article, the higher the value.
If I don’t follow the user, that’s a slight penalty.
If I don’t follow the organization, that’s a slight penalty.
Give a bit more weight to someone I follow leaving a comment.
Give a bit of weight to the # of comments (but don’t over indulge in flame wars)
Really discredit spam.

The scalars that I chose in the SQL statement are arbitrary, and we’d definitely want to tweak the values. The query’s not yet ready for primetime, but I’m curious to see it’s effect.

WITH top_articles AS (
  SELECT articles.id,
    ( (1 - (current_date - published_at::date) * 0.05) *
      (CASE articles.spaminess_rating
         WHEN 0 THEN 1
         ELSE 0.05 END) *
      (CASE ABS(articles.experience_level_rating - 5)
         WHEN 0 THEN 1
         WHEN 1 THEN 0.98
         WHEN 2 THEN 0.97
         WHEN 3 THEN 0.96
         WHEN 4 THEN 0.95
         WHEN 5 THEN 0.94
         ELSE 0.93 END) *
      (CASE COUNT(comments.id)
         WHEN 0 THEN 0.95
         WHEN 1 THEN 0.98
         WHEN 2 THEN 0.99
         ELSE 1 END) *
       (CASE COUNT(followed_tags.follower_id)
         WHEN 0 THEN 0.4
         WHEN 1 THEN 0.9
         ELSE 1 END) *
       (CASE articles.comments_count
         WHEN 0 THEN 0.9
         WHEN 1 THEN 0.94
         WHEN 2 THEN 0.95
         WHEN 3 THEN 0.98
         WHEN 4 THEN 0.999
         ELSE 1 END) *
       (CASE COUNT(followed_user.follower_id)
         WHEN 0 THEN 0.8
         WHEN 1 THEN 1
         ELSE 1 END) *
       (CASE COUNT(followed_org.follower_id)
            WHEN 0 THEN 0.95
            WHEN 1 THEN 1
            ELSE 1 END)) AS relevance_score,
articles.experience_level_rating,
articles.published_at,
articles.comments_count,
STRING_AGG(tags.name, ', ') AS tag_names,
articles.title
FROM articles
LEFT OUTER JOIN taggings
  ON taggings.taggable_id = articles.id
    AND taggable_type = 'Article'
INNER JOIN tags
  ON taggings.tag_id = tags.id
LEFT OUTER JOIN follows AS followed_tags
  ON tags.id = followed_tags.followable_id
    AND followed_tags.followable_type = 'ActsAsTaggableOn::Tag'
    AND followed_tags.follower_type = 'User'
    AND followed_tags.follower_id = :user_id
LEFT OUTER JOIN follows AS followed_user
  ON articles.user_id = followed_user.followable_id
    AND followed_user.followable_type = 'User'
    AND followed_user.follower_id = :user_id
    AND followed_user.follower_type = 'User'
LEFT OUTER JOIN follows AS followed_org
  ON articles.organization_id = followed_org.followable_id
  AND followed_org.followable_type = 'Organization'
  AND followed_org.follower_id = :user_id
  AND followed_org.follower_type = 'User'
LEFT OUTER JOIN comments
  ON comments.commentable_id = articles.id
    AND comments.commentable_type = 'Article'
    AND followed_user.followable_id = comments.user_id
    AND followed_user.followable_type = 'User'
LEFT OUTER JOIN user_blocks
  ON user_blocks.blocked_id = articles.user_id
    AND user_blocks.blocked_id IS NULL
    AND user_blocks.blocker_id = :user_id
WHERE published = true
GROUP BY articles.id, articles.title, articles.published_at, articles.comments_count, articles.experience_level_rating, articles.spaminess_rating
ORDER BY relevance_score DESC, articles.published_at DESC
LIMIT 25)
SELECT articles.* FROM articles INNER JOIN top_articles ON top_articles.id = articles.id;

The end result is a relevance_score between 0 and 1 (well I suppose there could be negative numbers but that’s if we have even older articles).

There are more things I want to consider, and could add into the above query:

I need to account for what to do with moderator’s up and down voting.
I need to account for folks who’ve given a negative weight to a tag.

Next Steps

I also went ahead and submitted a draft PR to begin having more conversations.

DEV Community

Practicing Postgresql and Postulating (Im)Provements

Returning Again to Databases

The Reigning Champion

And in This Corner

The Challenger

Next Steps

Top comments (0)

Read next

The 9 types of Programmers in 2024

Saga Pattern in Microservices - Parte 4

4 Best Practices for Backend Developers

Code of game engine written in Java: what does it hide?