DEV Community

Jeremy Friesen for The DEV Team

Posted on • Originally published at takeonrules.com on

Building and Documenting a Nuanced ActiveRecord Approach

Or How I'm Working to Improve the Feed Relevance of DEV.to

Earlier I wrote Practicing Postgresql and Postulating (Im)Provements hinting at some possible changes regarding Forem’s feed algorithm.

This past week, I’ve been iterating on a pull request.
Which at the time of writing is still a
Work in Procress (WIP)
.

Introducing Articles::Feeds::WeightedQueryStrategy

The core concept I introduced was the Articles::Feeds::WeightedQueryStrategy class; a documented and configurable query strategy. Because of the WIP nature, I’m including a link to a Github Gist that is the state of the code as of .

Class-level documentation for Articles::Feeds::WeightedQueryStrategy

@api private

This is an experimental object that we’re refining to be a competetor to the existing feed strategies.

It works to implement conceptual parity with two methods of Articles::Feeds::LargeForemExperimental:

  • #default_home_feed
  • #more_comments_minimal_weight_randomized

What do we mean by “conceptual parity”? Those two methods are used in the two feeds controllers: StoriesController and Stories::FeedsController. And while they use some of the internal tooling there’s some notable subtle differences.

Where this class differs is that it is aiming to build the feed based from the given user’s perspective. Whereas the other Feed algorithm starts with a list of candidates that are global to the given Forem (e.g., starting the base query from the articles.score, a volatile and swingy value that favors global reactions over user desired content).

This is not quite a chronological only feed but could be easily modified to favor that.

@note One possible short-coming is that the query does not account for the Forem’s administrators.

@note For those considering extending this, be very mindful of Structured Query Language (SQL 🔍) injection.

Configurable Options

As part of the development process, I extracted the configurable scoring methods.

Top-level documentation for scoring methods

This constant defines the allowable relevance scoring methods.

A scoring method should be a SQL fragment that produces a value between 0 and 1. The closer the value is to 1, the more relevant the article is for the given user. Note: the values are multiplicative. Make sure to consider if you want a 0 multiplier for your score. Aspirationally, you may want to think of the relevance_score as the range (0,1]. That is greater than 0 and less than or equal to 1.

In addition, as part of initialization, the caller can configure each of the scoring methods cases and fallback.

Each scoring method has the following keys:

clause
The SQL clause statement; note: there exists a coupling between the clause and the SQL fragments that join the various tables. Also, under no circumstances should you allow any user value for this, as it is not something we can sanitize.
cases
An Array of Arrays, the first value is what matches the clause, the second value is the multiplicative factor.
fallback
When no case is matched use this factor.
requires_user
Does this scoring method require a given user. If not, don't use it if we don't have a nil user.

The configurable options as of are:

daily_decay_factor
Weight to give based on the age of the article.
comment_count_by_those_followed_factor
Weight to give for the number of comments on the article from other users that the given user follows.
comments_count_factor
Weight to give to the number of comments on the article.
experience_factor
Weight to give based on the difference between experience level of the article and given user.
following_author_factor
Weight to give when the given user follows the article's author.
following_org_factor
Weight to give to the when the given user follows the article's organization.
latest_comment_factor
Weight to give an article based on it's most recent comment.
matching_tags_factor
Weight to give for the number of intersecting tags the given user follows and the article has.
reactions_factor
Weight to give for the number of reactions on the article.
spaminess_factor
Weight to give based on spaminess of the article.

I’ve structured the code to allow for the initializer of the Articles::Feeds::WeightedQueryStrategy to configure which methods to use as well as the factors. The idea being that we can easily iterate on feed refinement and even open the door for site-wide configuration of these scoring methods.
But that’s a future exercise.

Again, the goal of these scoring methods is to rank articles against the user’s apparent preferences.
For astute readers and those following the code, I’ve taken no consideration for the weights a user has given to tags.

ActiveRecord Antics

When I was iterating on the implementation, I was writing and refining lots of SQL. I wrote a handful of RSpec (RSpec 🔍) specs that verified I had valid queries. And as I drew closer to testing this in the User Interface (UI 🔍), I knew that I had one significant problem to address.

I wanted to return an ActiveRecord::Relation object; that’s an object from which you can chain ActiveRecord::Base.scope calls and other ActiveRecord::Query methods.

The reason being that I really wanted to re-use two Article methods:

In the case of .limited_column_select, I wanted to ensure that I wasn’t returning all of the columns from the articles table. I wanted to avoid duplicating the knowledge of what fields should be included in the result set.
A lot of things have been written about
Don’t Repeat Yourself (DRY 🔍)
principles. But I believe the it’s more important to focus on “Don’t Repeat Knowledge”.

More important was wanting to re-use .includes(top_comments: :user). Without those eager includes, when it came time to render the feed, each result would query the top comment and it’s associated user. So the naive implementation would result in 2N+1 queries, where N was the number of articles in the result set.

My solution was to perform some ActiveRecord antics. I had previously done extensive antics in my query implementations of Sipity. Fortunately, for the Forem implementation I didn’t need to dive deep into Arel.

Below is the part of the implementation on which I want to focus; the preceding numbers are the line numbers I’ll use in the example.
Remember you can see the whole class at this Gist.

1 Article.where(
2   Article.arel_table[:id].in(
3     Arel.sql(
4       Article.sanitize_sql(unsanitized_sub_sql)
5     )
6   )
7 ).limited_column_select.includes(top_comments: :user).order(published_at: :desc)

Enter fullscreen mode Exit fullscreen mode

Let’s work from the inside out.

Starting with line 4: Article.sanitize_sql(unsanitized_sub_sql). The unsanitized_sub_sql is the SQL that is built from the scoring method configuration and the necessary INNER JOIN and LEFT OUTER JOIN to build the query.
It is a non-trivial query, and writing it mostly by hand made this easier to implement.

The Article.sanitize_sql call ensures that we have sanitized the output.
I believe my implementation has avoided any SQL injection vectors.

Moving out to line 3: Arel.sql marks the resulting string as safe SQL.
Without this step, the sanitized_sql result with be treated as NULL.

Line 3 returns a valid sub-query with a SQL select clause of SELECT articles.id.

Moving out to line 2: the Article.arel_table[:id].in resolves to the SQL fragment: articles.id IN (sub-query).

Line 2 is the inflection point that moves us from hand-written SQL into the ActiveRecord::Querying module space.

Moving out to line 1: this is where we now use the ActiveRecord::Querying.where method, which is very family to Ruby on Rails (Rails 🔍) developers.

Moving on to line 7: because we now have an ActiveRecord::Relation object, we can chain model scopes and get all of the ActiveRecord goodness.

Conclusion

There’s quite a lot going on with my proposed change, but I wanted to share two relevant bits that might make you curious to take a look at the implementation details.

I’m uncertain when we’ll be deploying this change, as I need to wire in some performance instrumentation to compare against the original strategy.

All of this is in service of trying to improve the baseline feed experience and bring some further insight and clarity into how things make it into a given user’s feed.

Discussion (0)