Just found something very interesting in PostgreSQL, thanks to Claude.
I have a really nasty query on some huge tables. Here's the basic layout that we're concerned with:
class RawRoyaltyRecord < ApplicationRecord
belongs_to :royalty_input_batch_partial
belongs_to :track
has_many :raw_royalty_records_sales
end
class RoyaltyInputBatchPartial < ApplicationRecord
belongs_to :pro
has_many :raw_royalty_records
end
class RawRoyaltyRecordsSale < ApplicationRecord
belongs_to :raw_royalty_record
belongs_to :sale
end
Don't worry about "Pro" and "Sale" - we just need their IDs.
The concept is this. We get spreadsheets in every quarter or so with new raw_royalty_records. I have to match them to sales, which can be a bit tricky. Because of this trickiness, the simplest way to match them is to look at previously matched raw_royalty_records from the same Pro (through the royalty_input_batch_partials
table) that have matching track_id
and customer
values.
Note that the customer
value is stripped of extraneous leading/trailing spaces before being saved the in the database, and we have an index on lower(customer)
to allow us to search quickly for them.
This thing was dog slow. It started out fine, but as we've added millions of records it got slower.
The way this worked was that we'd go get a list of prior matched raw_royalty_records and create a hash to look up the sale_ids based on the customer and track_id:
@pro = Pro.find(params[:pro_id])
@raw_royalty_records = @pro.raw_royalty_records
.sans_sales
.where("track_id is not null")
.includes(:raw_royalty_records_sales, track: :library)
@previously_matched_raw_royalty_records =
@pro.raw_royalty_records.with_sales.where("track_id is not null")
.select(:customer, :track_id, :id).distinct.includes(:sales).each_with_object({}) do |rrr, hash|
hash[rrr.track_id] ||= {}
hash[rrr.track_id][rrr.customer.strip.downcase] = rrr.sales
end
# Later on, looping through @raw_royalty_records
@matches[track_id] ||= {}
previously_matched = @previously_matched_raw_royalty_records.dig(track_id, customer_key)
if previously_matched
@matches[track_id][customer_key] = { match_type: 'previously_matched', sales: previously_matched }
#break
end
Not the most beautiful code I've written, but you get the idea.
That worked great when there were a couple hundred thousand records, but now we have a hundred and something thousand unmatched records, and many matched records. Looking in the database, for one particular PRO that I'm testing with has around 287,000 records total, with 259,000 not connected to a sale. So, our query there was pulling in 259,000 records total along with some associated records. That requires CPU and memory. Way too much.
But I realized quickly that I can match up the records in an SQL query. It's ugly, but using a common table expression takes a lot of the complexity out and works well because it needs to compute certain expressions only a single time.
WITH eligible_records AS (
SELECT DISTINCT rrr.track_id, LOWER(rrr.customer) AS lower_customer
FROM raw_royalty_records rrr
INNER JOIN royalty_input_batch_partials ribp ON ribp.id = rrr.royalty_input_batch_partial_id
INNER JOIN raw_royalty_records_sales rrrs ON rrrs.raw_royalty_record_id = rrr.id
WHERE ribp.pro_id = 960
AND rrr.track_id IS NOT NULL
)
SELECT rr.*
FROM raw_royalty_records rr
INNER JOIN royalty_input_batch_partials ribp ON rr.royalty_input_batch_partial_id = ribp.id
LEFT OUTER JOIN raw_royalty_records_sales rrs ON rrs.raw_royalty_record_id = rr.id
WHERE ribp.pro_id = 960
AND rrs.id IS NULL
AND rr.track_id IS NOT NULL
AND EXISTS (
SELECT 1
FROM eligible_records er
WHERE er.track_id = rr.track_id
AND er.lower_customer = LOWER(rr.customer)
);
That's awesome, but what I really want to do is just create the new records in raw_royalty_records_sales
. But I also want to get a list of what was just created. Technically, I could do that by saving the highest id in raw_royalty_records_sales
, then looking at the row count. This thing shouldn't run overlapping (famous last words), but I'm not going to rely on such silliness.
This is where common table expressions shine.
WITH eligible_records AS ( -- This gets a list of existing track_id, customer, and sale_ids
SELECT DISTINCT rrr.track_id, LOWER(rrr.customer) AS lower_customer, rrrs.sale_id
FROM raw_royalty_records rrr
INNER JOIN royalty_input_batch_partials ribp ON ribp.id = rrr.royalty_input_batch_partial_id
INNER JOIN raw_royalty_records_sales rrrs ON rrrs.raw_royalty_record_id = rrr.id
WHERE ribp.pro_id = 960
AND rrr.track_id IS NOT NULL
),
inserted_records AS (
INSERT INTO raw_royalty_records_sales (raw_royalty_record_id, sale_id, created_at, updated_at)
SELECT DISTINCT rr.id, er.sale_id, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP
FROM raw_royalty_records rr
INNER JOIN royalty_input_batch_partials ribp ON rr.royalty_input_batch_partial_id = ribp.id
LEFT OUTER JOIN raw_royalty_records_sales rrs ON rrs.raw_royalty_record_id = rr.id
INNER JOIN eligible_records er ON er.track_id = rr.track_id and er.lower_customer = LOWER(rr.customer)
WHERE ribp.pro_id = 960
AND rrs.id IS NULL
AND rr.track_id IS NOT NULL
RETURNING *
)
SELECT ir.raw_royalty_record_id, ir.sale_id, rrr.track_id, rrr.customer
FROM inserted_records ir INNER JOIN raw_royalty_records rrr
ON rrr.id=ir.raw_royalty_record_id
That's a very similar query, but now it's using INSERT INTO...RETURNING *
- in a CTE. So the actual query that is run selects from the items that were just inserted, joins them with raw_royalty_records
record to get the track_id
and customer
, and returns the records in a normalized fashion.
I have 3448 raw_royalty_records
records that can match up like this. Queries that took multiple seconds now take around a second, including the insert. This is a huge win.
Unfortunately, I'm not really using a lot of the cool Railsy ActiveRecord goodies to make this more readable, but that's what good documentation covers. I've sped this up by a couple of orders of magnitude and made it use 1% of the memory of the original. This is a huge win. I can handle an ugly query with those numbers.
Top comments (0)