Introduction
The SQL provides us many features to query with efficiency: sub-queries, custom joins, aggregations, etc. We can use all of this with an ORM like ActiveRecord
, however often we ignore it. Why? I believe there are three main reasons for this:
- You're not an advanced user of SQL;
- You don't know enough of
ActiveRecord
to use its advanced features; - You simply are attained to some bad habits because of the facilities of your programming language (after all, it's much simpler to iterate over ruby objects doing simply comparisons than thinking in a complex custom join, right?).
All good with that until you don't face the side-effects of poor writing code: memory leak and low performance.
This post has the intention of deepening a little bit more in what ActiveRecord
can do to help you avoid the problems mentioned above, and benchmark the ActiveRecord
approach with some other solution using a ruby Enumerable
.
Note: If you wanna follow along with the coding of this post, you can use this script with all the data, queries, and comparisons I used.
Domain
In this post, a simplified version of a Cryptocurrency Exchange will be used as a domain. To better understand what it's being queried in the next sections, here is an explanation of the meaning of each table:
- networks: The networks with a running blockchain, each one working with a specific set of currencies;
- currencies: Cryptocurrencies;
- currency_networks: An association table linking a currency to a network, meaning that this currency can be processed in the specified network;
- pairs: A pair of currencies that can be traded, e.g., "BTC-ETH" means "buying ETH (quote currency) with BTC (base currency)";
- tickers: Last 24h information from a specific currency pair.
It's not in the scope of this post to discuss anything related to the right way of modeling this domain. The domain was built just thinking in exemplifying queries.
Querying belongs_to
associations
Consider the following domain models:
class Network < ActiveRecord::Base
has_many :currency_networks
has_many :currencies, through: :currency_networks
end
class Currency < ActiveRecord::Base
has_many :currency_networks
has_many :networks, through: :currency_networks
end
class CurrencyNetwork < ActiveRecord::Base
belongs_to :currency
belongs_to :network
end
Let's say we wanna have all currency networks where the network is active. We can solve this using a ruby Enumerable
:
CurrencyNetwork.all.select { |cn| cn.network.active? }
Although that solves our problem, we're potentially scaling to new ones:
- Our
select
block is hitting the database for eachCurrencyNetwork
we retrieved; - We're retrieving more data than we really need with
cn.network
: we don't wanna retrieve theNetwork
data, just theCurrencyNetwork
. Bare in mind that when we callcn.network
we are instantiating a newActiveRecord
object to store theNetwork
result, consuming more memory than needed.
Those aren't big problems to a situation where we don't have many records in the database, as in the case of the script I'm running to exemplify this command, but you can imagine what that cause for your application depending on how frequent you run this action and the number of rows in your database.
We can improve this by delegating great part of the heavy lifting to our database doing this:
CurrencyNetwork.joins(:network).where(networks: { active: true })
If we append to_sql
at the end of this query we can see the SQL that's producing:
SELECT "currency_networks".*
FROM "currency_networks"
INNER JOIN "networks"
ON "networks"."id" = "currency_networks"."network_id"
joins
is a powerful resource responsible for joining our tables (INNER JOIN
), what we can do because CurrencyNetwork belongs_to :network
(specified on the ON
clause). ActiveRecord
provides us a clean way to access the table attributes from our belongs_to
association with the hash syntax: { networks: { active: true } }
.
What are the benefits of your query approach:
- Performance: We only hit the database once, compared to the potentially thousands of times with the
Enumerable
solution; - Spare memory: We only receive
ActiveRecord
instances fromCurrencyNetwork
instead of theNetwork
instances that came as plus with theEnumerable
solution.
We don't need to rely upon a priori reasoning only, we can use memory_profiles and benchmark_ips to compare the memory consumption and iterations per second of each solution.
Belongs to query - Enumerable solution (memory profiler):
Total allocated: 120236 bytes (1290 objects)
Total retained: 27100 bytes (224 objects)
Belongs to query - ActiveRecord query solution (memory profiler):
Total allocated: 4584 bytes (52 objects)
Total retained: 360 bytes (7 objects)
Considering how the database was populated in the script used for this test, with the query solution we consume ~26x less memory with the total allocations and ~75x less memory with retained allocations.
IPS - Belongs to query - ActiveRecord query solution: 17716.1 i/s
IPS - Belongs to query - Enumerable solution: 656.8 i/s - 26.97x (± 0.00) slower
Comparing the iterations per second, we can check that the Enumerable
solution is 26.97x slower.
Querying has_many
associations
Consider the same domain models of the previous section. We wanna do a slightly similar query that the previous one, but now our result must be the currencies belonging to active networks.
Enumerable solution:
Currency.all.select { |currency| currency.networks.where(active: true).any? }
In this example, I'm doing things propositional worse mixing a query with a where
clause, but this is probably with what you'd come up to if not using joins
.
ActiveRecord
query solution:
Currency
.joins(currency_networks: :network)
.where(networks: { active: true })
You can notice that we use the same interface to query has_many
as the one used with belongs_to
, so there's nothing new here besides the pluralized relationship name in the joins
.
The benefits are the same we saw in the previous section, and we can benchmark both approaches here too:
Has many query - Enumerable solution (memory profiler):
Total allocated: 171024 bytes (2107 objects)
Total retained: 24566 bytes (243 objects)
Has many query - ActiveRecord query solution (memory profiler):
Total allocated: 4464 bytes (50 objects)
Total retained: 280 bytes (6 objects)
We can see an, even more, accentuated memory consumption in this case with the Enumerable
solution. The ActiveRecord
query solution consumed ~38x less memory in the total allocated, and ~87x less memory in the total retained.
IPS - Has many query - ActiveRecord query solution: 23395.8 i/s
IPS - Has many query - Enumerable solution: 569.6 i/s - 41.08x (± 0.00) slower
No surprise, the Enumerator
solution is 41.08x slower.
Querying With Aggregations and Subqueries
Aggregations are actions that imply in perform any mathematical operation in your data, be it counts, sums, averages calculations, or comparisons to retrieve maximums and minimums.
Consider the following domain models:
class Currency < ActiveRecord::Base
has_many :currency_networks
has_many :networks, through: :currency_networks
has_many :pairs_as_base, class_name: "Pair", foreign_key: :base_currency_id
has_many :pairs_as_quote, class_name: "Pair", foreign_key: :quote_currency_id
end
class Pair < ActiveRecord::Base
has_many :tickers
belongs_to :base_currency, class_name: "Currency", foreign_key: :base_currency_id
belongs_to :quote_currency, class_name: "Currency", foreign_key: :quote_currency_id
end
class Ticker < ActiveRecord::Base
belongs_to :pair
validates_presence_of :trade_count
end
Using aggregation we can answer how many trades were performed in all pairs:
Ticker.avg(:trade_count)
# => 458926.666666667
We deliver the heavy lifting of looking at each ticker record and performing the calculation to our database, which is good. At this point, you already have a notion of the performance improvement of ActiveRecord
queries over Enumerable
based solutions to do the same, so I'll spare you of the performance metrics in more simple examples like this one.
Bear in mind the result of this query I added as commentary in the snippet above: it's an integer, not an ActiveRecord
relation. So it's not possible to keep chaining queries after calling an aggregation on that way.
You can expect another outcome if you're using aggregation over grouped columns. Let's say we wanna retrieve the sum of trade_count
for each currency that is a base_currency
of a Pair
:
Currency
.joins(pairs_as_base: :tickers)
.group("currencies.name").sum(:trade_count)
# => {"Bitcoin"=>930518, "Ethereum"=>103466, "Polkadot"=>342796}
As a result, you have a hash with your aggregated answer.
Subqueries
Subqueries are queries inside queries, and can be placed basically anywhere in your query: SELECT
, WHERE
, FROM
, etc. You can perform interesting queries combining aggregations with subqueries.
Let's say we wanna list all pairs are inside tickers that have the trade_count
above the average trade counts of all tickers:
# Hits the database twice
Pair
.joins(:tickers)
.where("trade_count >= :avg", avg: Ticker.average(:trade_count))
# Hits the database only once
Pair
.joins(:tickers)
.where("trade_count >= (:avg)", avg: Ticker.select("AVG(trade_count)"))
Why the first query hits the database twice, and the second query only once? It's a detail of the ActiveRecord
syntax.
Remember what we saw just before: calling the aggregation method as a ruby method directly from ActiveRecord
object returns to us a number, and that can only be done by executing the query. So Ticker.average(:trade_count)
is returning a number, and ActiveRecord
is replacing :avg
with it.
If you wanna have lazy using aggregation inside a query, we explicitly express the aggregation as an SQL function like here Ticker.select("AVG(trade_count)")
. In the example of the single database hit, ActiveRecord
is taking this query and injecting it in place of :avg
. Be aware that here we're calling (:avg)
! That's because a subquery must be evaluated inside ()
, otherwise an error will be raised. Appending to_sql
in the single database hit query we can see the query that's performed:
SELECT "tickers".*
FROM "tickers"
WHERE (trade_count > (SELECT AVG(trade_count) FROM "tickers"))
Following to another query, we could ask now for the tickers that have their trade_count
above or equal the average trade_count
of all listed tickers:
avg_trade_count_sql = Ticker.select("AVG(trade_count)").to_sql
Ticker
.group(:pair_id)
.having("tickers.trade_count >= (#{avg_trade_count_sql})")
Here we use the subquery in the having
clause, producing the following query:
SELECT "tickers".*
FROM "tickers"
GROUP BY "tickers"."pair_id"
HAVING (tickers.trade_count >= (SELECT AVG(trade_count) FROM "tickers"))
Or we could do a query to analyze this info in another way: get the average ticker trade_count
for each pair, and only return the tickers that have trade_count
above or equal the average trade_count
of the tickers belonging to given pair:
avg_trade_count_by_pairs_sql = Ticker.select("pair_id, AVG(trade_count) as average").group(:pair_id).to_sql
Ticker
.joins("INNER JOIN (" + average_trade_count_by_pairs_sql + ") trades ON tickers.pair_id = trades.pair_id")
.where("tickers.trade_count >= trades.average")
This produces the following SQL:
SELECT "tickers".*
FROM "tickers"
INNER JOIN (
SELECT pair_id, AVG(trade_count) as average
FROM "tickers" GROUP BY "tickers"."pair_id"
) trades
ON tickers.pair_id = trades.pair_id
WHERE (tickers.trade_count >= trades.average)
This is a tricky one, so let's break it down to understand how we're doing this. First off, we have our subquery SQL in avg_trade_count_by_pairs_sql
. If executed individually, will return to you an aggregated result like the following:
Each pair_id
is listed with an aggregated result that represents the average of trade_count
of all of the tickets belonging to the pair (i.e., SELECT * FROM tickets WHERE pair_id = [THE PAIR ID]
). This SQL is merged inside the Ticker
custom join (INNER JOIN
) we named as trades
. In the end, we have tickers.trade_count
from the current ticker row being iterated, and trades.average
that we have from our subquery, and represents the average ticker_count
of the ticker pair of the current iteration (association made because of our ON tickers.pair_id = trades.pair_id
from the custom join).
We're already writing custom joins, anything beyond that it's more SQL specifics than ActiveRecord
, so we can end up here. To finish, though, let's benchmark this query with a simple solution using ruby Enumerable
:
tickers = []
Pair.all.each do |pair|
avg = Ticker.where(pair: pair).average(:trade_count)
selectable_tickers = Ticker.where(pair: pair).where("trade_count >= ?", avg)
tickers.push(*selectable_tickers.to_a)
end
Avg trade count by pairs - Enumerable solution (memory profiler)
Total allocated: 72357 bytes (872 objects)
Total retained: 573 bytes (9 objects)
Avg trade count by pairs - ActiveRecord query solution (memory profiler)
Total allocated: 6348 bytes (86 objects)
Total retained: 296 bytes (5 objects)
Our ActiveRecord
solution consumes ~11x less memory in the total allocated, and ~2x less memory in the total retained.
IPS - Avg trade count by pairs - ActiveRecord query solution: 15095.6 i/s
IPS - Avg trade count by pairs - Enumerable solution: 740.6 i/s - 20.38x (± 0.00) slower
Comparing the iterations per second, we can check that the Enumerable
solution is 20.38x slower.
Conclusion
In this post, we saw how to use advanced ActiveRecord
features to go beyond a simple where
clause. We saw how to use ActiveRecord
to query belongs_to
and has_many
associations, using joins
, aggregations, and subqueries, combining all.
We also compared some queries with the equivalent using a ruby Enumerable
based solution. For benchmarking both approaches we checked the memory consumption and iterations per second, which showed without surprise the Enumerable
based solutions consuming in most cases dozens of times more memory because of extra allocations and being also dozens of times slower.
Of course, we should keep in mind that the benchmark was done using only a very small database. We weren't worried about checking how much worse it can get with larger databases, or searching for points of stabilization (a.k.a., "it doesn't get much worse than that"). The idea of benchmarking was to show you how bad it can be when you rely upon ruby Enumerator
to do things that our database is capable of doing through ActiveRecord
, which is a common bad "code smell".
This is it! If you have any comments or suggestions, don't hold back, let me know.
Options if you like my content and would like to support me directly (never required, but much appreciated):
BTC address: bc1q5l93xue3hxrrwdjxcqyjhaxfw6vz0ycdw2sg06
Top comments (2)
Great post really helpful and well written.
I was researching it and I was surprised when I look that you @rwehresmann are the author.
Haha What a coincidence! I'm glad to see that you liked, it's great to see you here