Handling huge data on Postgres

#postgres #devops #database #architecture

OK, once you're here, you already know about postgres! So, let's dive right into the topic we're looking into, how to handle a particular table when it blows onto your face.🥴

Let me just brief about my experience with huge load of data and how we optimized it. During the final semester project that we did, we had almost up to around 20 Million wifi probe records and counting in a particular table that we chose to call "Wifis" (sequelize generated it!).

Problem 1

First issue started when we were just trying to count the number of probe requests we're getting in real time and display the total count onto a grafana dashboard(which looks relatively simple),until when we found out that querying on a 20 million record table with SELECT COUNT(*) FROM "Wifis";
was a working but slow solution to the problem since, it almost took about 5 seconds! to get back the count.

Well, after a few searches found out why this was something that was built into postgres. PostgreSQL uses multiversion concurrency control (MVCC) to ensure consistency between simultaneous transactions. This means each transaction may see different rows and different numbers of rows in a table. There is no single universal row count that the database could cache, so it must scan through all rows counting how many are visible. Performance for an exact count grows linearly with table size.

So, to overcome that, we actually setup a second new table which would just hold the counts of all the necessary things that needed to be shown on the dashboard and keeps updating after every transaction on the "Wifis" table. How do we achieve this?

Enter TRIGGERS....

So the plan was simple

create a second table called count_table

CREATE TABLE row_counts (
    relationname text PRIMARY KEY,
    relationtuples bigint
);

Initialize that table with the current status/count of the records in the "Wifis" table.

INSERT INTO row_counts (relationname, relationtuples)
VALUES (
        'wifiProbes',
        (
            SELECT count(*)
            from "Wifis"
        )
    );

create a TRIGGER function that adjusts the count of row_counts table whenever there is action on the "Wifis" table.

CREATE OR REPLACE FUNCTION adjust_count()
RETURNS TRIGGER AS
$$
   DECLARE
   BEGIN
   IF TG_OP = 'INSERT' THEN
      EXECUTE 'UPDATE row_counts set relationtuples=relationtuples +1 where relationname = ''' || TG_RELNAME || '''';
      RETURN NEW;
   ELSIF TG_OP = 'DELETE' THEN
      EXECUTE 'UPDATE row_counts set relationtuples=relationtuples -1 where relationname = ''' || TG_RELNAME || '''';
      RETURN OLD;
   END IF;
   END;
$$
LANGUAGE 'plpgsql';

Next create TRIGGER on the "Wifis" table

CREATE TRIGGER probes_count BEFORE
INSERT
    OR DELETE ON "Wifis" FOR EACH ROW EXECUTE PROCEDURE adjust_count();

So the next time you just want to get the count of the objects in your huge table, just query the relatively smaller table row_counts which is updated everytime an operation occurs on the main table, hence bringing down your COUNT queries from 5 Seconds! to a mere milliseconds.🚀

Problem 2

The other issues we found during this exercise is that we were querying that exact same table "Wifis" to display some data on the frontend, we had a necessity to ORDER them by timestamp.

First query we ran took almost 25 Seconds! to respond and by that time the frontend HTTP request already had timed out.

Further testing what was actually taking up time, we found that postgres actually took so long to sort them and respond.(you can see on that more below)

pi_production=# select * from "Wifis" where "timestamp" >= timestamp '2020-06-01T18:30:00.000Z'  and "timestamp" <= timestamp '2020-06-03T15:11:06.276Z' order BY "timestamp" limit 50
                               QUERY PLAN
----------------------------------------------------------------------------------
 Limit (cost=185076.20..185076.23 rows=50 width=11)
       (actual time=896.739..896.740 rows=50 loops=1)
        -> Sort (cost=185076.20..197576.20 rows=20789184 width=11)
                (actual time=896.737..896.738 rows=10 loops=1)
           Sort Key: x
           Sort Method: top-N heapsort Memory: 25kB
           -> Seq Scan on t_test (cost=0.00..77028.00 rows=20789184 width=11) 
                                 (actual time=1.154..282.408 rows=20789184 loops=1)
Planning time: 0.087 ms
Execution time: 24.768 s
(7 rows)

As you can see, going through/scanning through in layman terms took a whole lot of time.

Reading more about it seemed to point out the issue. Since, these rows are not in memory for postgres to process them, it actually has to get them from disk, load them onto memory and then process them.So, to actually speed it up, all we have to do is just hold them in memory. So INDEX them in postgres. i.e INDEX the column that you'd want to use more often, in my case it was timestamp since we would order the items in the table more often. In case you'd search for an item more often, INDEX the id.

pi_production=# create index timestamp_index on "Wifis"(timestamp);
CREATE INDEX
Time: 40252.817 ms

after creating the index, the same query of ORRDER BY timestamp took a mere 6.808ms all the way from ~25 Seconds.🚀

pi_production=# explain analyze select * from "Wifis" where "timestamp" >= timestamp '2020-06-01T18:30:00.000Z'  and "timestamp" <= timestamp '2020-06-03T15:11:06.276Z' order BY "timestamp" limit 50;
                                                                              QUERY PLAN                                               

---------------------------------------------------------------------------------------------------------------------------------------
--------------------------------
 Limit  (cost=0.56..169.98 rows=50 width=268) (actual time=0.240..0.316 rows=50 loops=1)
   ->  Index Scan using timestamp_index on "Wifis"  (cost=0.56..955416.29 rows=281963 width=268) (actual time=0.227..0.294 rows=50 loop
s=1)
         Index Cond: (("timestamp" >= '2020-06-01 18:30:00'::timestamp without time zone) AND ("timestamp" <= '2020-06-03 15:11:06.276'
::timestamp without time zone))
 Planning Time: 2.482 ms
 Execution Time: 0.475 ms
(5 rows)

Phew!! That was a lot of searching and implementing, remember when we have a huge table that often has operations on it and if you want it to be fast(Duh!) always INDEX them.🤗

If there's more problems with handling huge data on postgres, please be a star and drop them in the discussions, so I'm prepared to handle them next time.(Thanks to you)

Top comments (4)

Max • Jul 28 '21

I wonder if your solution to Problem 1 transferred the work from select to trigger and possibly even more so. What I mean is the total time it takes to maintain the count using a trigger may be more than doing a total recount once in a while.
Is there a way to fetch an approximate number of rows? It maintains the stats somewhere.

Pavan Kumar • Mar 13 '22

I agree, but the reason to prefer a trigger over count is that, if the count occurs all the time, it's better to take a hit in terms of triggers, rather than having a slower performance on every read.