DEV Community

Jeremy Stuckey for Optoro

Posted on

Vacuum after you tidy up

A story about a data migration, a sudden performance issue, and how we fixed it.

Here at Optoro, we collect a lot of data from our clients. In an ideal world, all this data conforms neatly to our schema, but that's often not the case. A common technique we use in our Rails applications is to store extra data in a "catch-all" database column such as a serialized hash or a jsonb column. We prefer not to drop any client data on the floor in case we need it later.

In this example, we have an ActiveRecord model named OrderItem with a postgres jsonb attribute called extra_data. The code that populates extra_data looks something like this:

known_attributes = attributes.slice(*OrderItem.column_names.map(&:to_sym))
extra_attributes = attributes.except(*OrderItem.column_names.map(&:to_sym))

known_attributes[:extra_data] = extra_attributes

OrderItem.create(known_attributes)
Enter fullscreen mode Exit fullscreen mode

Storing extra fields in this way allows us to revisit this data if the need arises. Occasionally, we realize one of these client-specific fields is a good fit for our domain model. When this happens, we want to turn it into a proper database column of its own. The procedure is fairly simple:

  1. Run a database migration to add the new column
  2. Routinely run a script to move data from jsonb to the new column
  3. Change the code to start using the new column

We performed this procedure recently with a field named purchase_order_number, during which we migrated data for about 4.5 million rows. Not long after, another team in our company told us that querying by this new field was extremely slow. They even produced a graph to back up the claim. The green line shows queries that include purchase_order_number, and the blue line shows queries that exclude it. Querying by purchase_order_number was taking an average of 2.4 seconds. Not good!

Slow purchase_order_number query

The first thought we had: "surely we indexed that column". A quick inspection of the database schema showed that we had.

The next debugging step was to use postgres' EXPLAIN functionality. We whittled down the query to its simplest form and ran it through EXPLAIN ANALYZE.

EXPLAIN ANALYZE select * from order_items
                join orders on order_items.order_id = orders.id
                where order_items.purchase_order_number = 'PO00000001';

                                                                    QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=512087.46..1266205.21 rows=23649 width=69) (actual time=2630.284..2813.946 rows=1 loops=1)
   Hash Cond: (order_items.order_id = orders.id)
   ->  Bitmap Heap Scan on order_items  (cost=6553.11..726261.45 rows=341748 width=4) (actual time=0.065..0.067 rows=1 loops=1)
         Recheck Cond: ((purchase_order_number)::text = 'PO00000001'::text)
         Heap Blocks: exact=1
         ->  Bitmap Index Scan on index_order_items_on_purchase_order_number  (cost=0.00..6467.68 rows=341748 width=0) (actual time=0.057..0.057 rows=1 loops=1)
               Index Cond: ((purchase_order_number)::text = 'PO00000001'::text)
   ->  Hash  (cost=444843.38..444843.38 rows=2505917 width=69) (actual time=1557.832..1557.832 rows=2545639 loops=1)
         Buckets: 524288 (originally 524288)  Batches: 16 (originally 8)  Memory Usage: 36865kB
         ->  Bitmap Heap Scan on orders  (cost=54361.42..444843.38 rows=2505917 width=69) (actual time=228.596..694.259 rows=2545639 loops=1)
               Heap Blocks: exact=45059

 Planning time: 0.689 ms
 Execution time: 2821.106 ms
(16 rows)
Enter fullscreen mode Exit fullscreen mode

I don't claim to understand the intricacies of that query plan, but it seemed suboptimal. Note the execution time of 2821.106 ms. That felt like a terribly long execution time to query an indexed column. It also lined up nicely with the response time reported earlier.

After some additional experimentation, a teammate of mine (who is much savvier with postgres) suggested running VACUUM ANALYZE. From the postgres docs:

VACUUM reclaims storage occupied by dead tuples. In normal PostgreSQL operation, tuples that are deleted or obsoleted by an update are not physically removed from their table; they remain present until a VACUUM is done. Therefore it's necessary to do VACUUM periodically, especially on frequently-updated tables.

Frequently-updated tables... we just updated about 4.5 million rows when we migrated the purchase_order_number field. Further on in the docs:

ANALYZE Updates statistics used by the planner to determine the most efficient way to execute a query.

That could explain the suspicious-looking query plan we saw above.

With anticipation and bated breath, we ran VACUUM ANALYZE (it took a while to finish). Afterwards we re-ran our query.

EXPLAIN ANALYZE select * from order_items
                join orders on order_items.order_id = orders.id
                where order_items.purchase_order_number = 'PO00000001';

                                                                    QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop  (cost=1.01..2496.88 rows=15 width=69) (actual time=0.063..0.066 rows=1 loops=1)
   ->  Index Scan using index_order_items_on_purchase_order_number on order_items  (cost=0.57..692.77 rows=213 width=4) (actual time=0.043..0.044 rows=1 loops=1)
         Index Cond: ((purchase_order_number)::text = 'PO00000001'::text)
   ->  Index Scan using orders_pkey on orders  (cost=0.44..8.46 rows=1 width=69) (actual time=0.016..0.018 rows=1 loops=1)
         Index Cond: (id = order_items.order_id)
 Planning time: 0.501 ms
 Execution time: 0.113 ms
(8 rows)
Enter fullscreen mode Exit fullscreen mode

Our execution time went from 2821.106 ms to 0.113 ms. That is ~25,000 times faster! We solved the case of the slow purchase_order_number query.

VACUUM and ANALYZE should be part of automated database maintenance, but in our case the data migration had an immediate negative impact. The moral of the story is to remember to vacuum after you tidy up.

UPDATE!

A colleague (and former DBA) pointed out that ANALYZE was likely what resolved this particular issue rather than VACUUM or the combination of the two.

When we added the column and index, postgres gathered statistics on their state. At the time that was about 4.5 million null values, which the query optimizer deemed not so useful. Running ANALYZE after the data migration regathered statistics on the index, which suddenly made it much more appealing to the query optimizer.

Top comments (0)