loading...
Cover image for Postgres Is Underrated—It Handles More than You Think
Heroku

Postgres Is Underrated—It Handles More than You Think

mostlyjason profile image Jason Skowronski Updated on ・9 min read

Thinking about scaling beyond your Postgres cluster and adding another data store like Redis or Elasticsearch? Before adopting a more complex infrastructure, take a minute and think again. It’s quite possible to get more out of an existing Postgres database. It can scale for heavy loads and offers powerful features which are not obvious at first sight. For example, its possible to enable in-memory caching, text search, specialized indexing, and key-value storage.

After reading this article, you may want to list down the features you want from your data store and check if Postgres will be a good fit for them. It’s powerful enough for most applications.

Why Adding Another Data Store is Not Always a Good Idea

As Fred Brooks put it in The Mythical Man-Month: "The programmer, like the poet, works only slightly removed from pure thought-stuff. [They] build castles in the air, from air, creating by exertion of the imagination."

Adding more pieces to those castles, and getting lost in the design, is endlessly fascinating; however, in the real world, building more castles in the air can get in your way. The same holds true for the latest hype in data stores. There are several advantages to choosing boring technology:

  • If someone new joins your team, can they easily make sense of your different data stores?
  • When you or another team member come back a year later, could they quickly pick up how the system works?
  • If you need to change your system or add features, how many pieces do you have to move around?
  • Have you factored in maintenance costs, security, and upgrades?
  • Have you accounted for the unknowns and failure modes when running your new data store in production at scale?

Although it can be managed by thoughtful design, adding multiple datastores does increase complexity. Before exploring adding additional datastores, it's worth investigating what additional features your existing datastores can offer you.

Lesser-known but Powerful Features of Postgres

Many people are unaware that Postgres offers way more than just a SQL database. If you already have Postgres in your stack, why add more pieces when Postgres can do the job?

Postgres caches, too

There’s a misconception that Postgres reads and writes from disk on every query, especially when users compare it with purely in-memory data stores like Redis.

Actually, Postgres has a beautifully designed caching system with pages, usage counts, and transaction logs. Most of your queries will not need to access the disk, especially if they refer to the same data over and over again, as many queries tend to do.

The shared_buffer configuration parameter in the Postgres configuration file determines how much memory it will use for caching data. Typically it should be set to 25% to 40% of the total memory. That’s because Postgres also uses the operating system cache for its operation. With more memory, most recurring queries referring the same data set will not need to access the disk. Here is how you can set this parameter in the Postgres CLI:

ALTER SYSTEM SET shared_buffer TO = <value>

Managed database services like Heroku offer several plans where RAM (and hence cache) is a major differentiator. The free hobby version does not offer dedicated resources like RAM. Upgrade when you’re ready for production loads so you can make better use of caching.

You can also use some of the more advanced caching tools. For example, check the pg_buffercache view to see what’s occupying the shared buffer cache of your instance. Another tool to use is the pg_prewarm function which comes as part of the base installation. This function enables DBAs to load table data into either the operating system cache or the Postgres buffer cache. The process can be manual or automated. If you know the nature of your database queries, this can greatly improve application performance.

For the really brave at heart, refer to this article for an in-depth description of Postgres caching.

Text searching

Elasticsearch is excellent, but many use cases can get along just fine with Postgres for text searching. Postgres has a special data type, tsvector, and a set of functions, like to_tsvector and to_tsquery, to search quickly through text. tsvector represents a document optimized for text search by sorting terms and normalizing variants. Here is an example of the to_tsquery function:

SELECT to_tsquery('english', 'The & Boys & Girls');

  to_tsquery   
---------------
 'boy' & 'girl'

You can sort your results by relevance depending on how often and which fields your query appeared in the results. For example, you can make the title more relevant than the body. Check the Postgres documentation for details.

Functions in Postgres

Postgres provides a powerful server-side function environment in multiple programming languages.

Try to pre-process as much data as you can on the Postgres server with server-side functions. That way, you can cut down on the latency that comes from passing too much data back and forth between your application servers and your database. This approach is particularly useful for large aggregations and joins.

What’s even better is your development team can use its existing skill set for writing Postgres code. Other than the default PL/pgSQL (Postgres’ native procedural language), Postgres functions and triggers can be written in PL/Python, PL/Perl, PL/V8 (JavaScript extension for Postgres) and PL/R.

Here is an example of creating a PL/Python function for checking string lengths:

CREATE FUNCTION longer_string_length (string1 string, string2 string)
  RETURNS integer
AS $$
  a=len(string1)
  b=len(string2)
  if a > b:
    return a
  return b
$$ LANGUAGE plpythonu;

Postgres offers powerful extensions

Extensions are to Postgres what plug-ins mean in many applications. Suitable use of Postgres extensions can also mean you don’t have to work with other data stores for extra functionality. There are many extensions available and listed on the main Postgres website.

Geospatial Data

PostGIS is a specialized extension for Postgres used for geospatial data manipulation and running location queries in SQL. It’s widely popular among GIS application developers who use Postgres. A great beginner’s guide to using PostGIS can be found here.

The code snippet below shows how we are adding the PostGIS extension to the current database. From the OS, we run these commands to install the package (assuming you are using Ubuntu):

$ sudo add-apt-repository ppa:ubuntugis/ppa
$ sudo apt-get update
$ sudo apt-get install postgis

After that, log in to your Postgres instance and install the extension:

CREATE EXTENSION postgis;
CREATE EXTENSION postgis_topology;

If you want to check what extensions you have in the current database, run this command:

SELECT * FROM pg_available_extensions;

Key-Value Data Type

The Postgres hstore extension allows storing and searching simple key-value pairs. This tutorial provides an excellent overview of how to work with hstore data type.

Semi-structured Data Types

There are two native data types for storing semi-structured data in Postgres: JSON and XML. The JSON data type can host both native JSON and its binary form (JSONB). The latter can significantly improve query performance when it is searched. As you can see below, it can convert JSON strings to native JSON objects:

SELECT '{"product1": ["blue", "green"], "tags": {"price": 10, "discounted": false}}'::json;

json                       
---------------------------------------------------------------------
 {"product1": ["blue", "green"], "tags": {"price": 10, "discounted": false}}

Tips for Scaling Postgres

If you’re considering switching off Postgres due to performance reasons, first see how far you can get with the optimizations it offers. Here we'll assume you've done the basics, like creating appropriate indexes. Postgres offers many advanced features, and while the changes are small they can make a big difference, especially if it keeps you from complicating your infrastructure.

Don’t over-index

Avoid unnecessary indexes. Use multi-column indexes sparingly. Too many indexes take up extra memory that crowd out better uses of the Postgres cache, which is crucial for performance.

Using a tool like EXPLAIN ANALYZE might surprise you by how often the query planer actually chooses sequential table scans. Since much of your table’s row data is already cached, oftentimes these elaborate indexes aren’t even used.

That said, if you do find slow queries, the first and most obvious solution is to see if the table is missing an index. Indexes are vital, but you have to use them correctly.

Partial indexes save space

A partial index can save space by specifying which values get indexed. For example, you want to order by a user’s signup date, but only care about the users who have signed up:

CREATE INDEX user_signup_date ON users(signup_date) WHERE is_signed_up;

Understanding Postgres index types

Choosing the right index for your data can improve performance. Here are some common index types and when you should use each one.

  • B-tree indexes B-tree indexes are balanced trees that are used to sort data efficiently. They’re the default if you use the INDEX command. Most of the time, a B-tree index suffices. As you scale, inconsistencies can be a larger problem, so use the amcheck extension periodically.
  • BRIN indexes A Block Range INdex (BRIN) can be used when your table is naturally already sorted by a column, and you need to sort by that column. For example, for a log table that was written sequentially, setting a BRIN index on the timestamp column lets the server know that the data is already sorted.
  • Bloom filter index A bloom index is perfect for multi-column queries on big tables where you only need to test for equality. It uses a special mathematical structure called a bloom filter that’s based on probability and uses significantly less space.
 CREATE INDEX i ON t USING bloom(col1, col2, col3);
 SELECT * from t WHERE col1 = 5 AND col2 = 9 AND col3 = 'x';
  • GIN and GiST indexes \ Use a GIN or GiST index for efficient indexes based on composite values like text, arrays, and JSON.

When Do You Need Another Data Store?

There are legitimate cases for adding another datastore beyond Postgres.

Special data types

Some data stores give you data types that you just can’t get on Postgres. For example, the linked list, bitmaps, and HyperLogLog functions in Redis are not available on Postgres.

At a previous startup, we had to implement a frequency cap, which is a counter for unique users on a website based on session data (like cookies). There might be millions or tens of millions of users visiting a website. Frequency capping means you only show each user your ad once per day.

Redis has a HyperLogLog data type that is perfect for a frequency cap. It approximates set membership with a very small error rate, in exchange for O(1) time and a very small memory footprint. PFADD adds an element to a HyperLogLog set. It returns 1 if your element is not in the set already, and 0 if it is in the set.

PFADD user_ids uid1
(integer) 1
PFADD user_ids uid2
(integer) 1
PFADD user_ids uid1
(integer) 0

Heavy real-time processing

If you’re in a situation with many pub-sub events, jobs, and dozens of workers to coordinate, you may need a more specialized solution like Apache Kafka. LinkedIn engineers originally developed Kafka to handle new user events like clicks, invitations, and messages, and allow different workers to handle message passing and jobs to process the data.

Instant full-text searching

If you have a real-time application under heavy load with more than ten searches going on at a time, and you need features like autocomplete, then you may benefit more from a specialized text solution like Elasticsearch.

Conclusion

Redis, Elasticsearch, and Kafka are powerful, but sometimes adding them does more harm than good. You may be able to get the capabilities you need with Postgres by taking advantage of the lesser-known features we’ve covered here. Ensuring that you are getting the most out of Postgres can save you time and help you avoid added complexity and risks.

To save even more time and headaches, consider using a managed service like Heroku Postgres. Scaling up is a simple matter of adding additional follower replicas, high availability can be turned on with a single click, and Heroku operates it for you. If you really need to expand beyond Postgres, the other data stores that we mentioned above, such as Redis, Apache Kafka and Elasticsearch, can all be easily provisioned on Heroku. Go ahead and build your castles in the air―but anchor them to a reliable foundation, so you can dream about a better product and customer experience.

For more information on Postgres, listen to Cloud Database Workloads with Jon Daniel on Software Engineering Daily.

Discussion

pic
Editor guide
Collapse
citizen428 profile image
Michael Kohl

Postgres 12 finally added generated columns, which also come in handy for fulltext search (no more triggers for updating tsvector columns):

pgdash.io/blog/postgres-12-generat...

Collapse
thojest profile image
thojest

Hi, this is a really nice article. The funny thing is, it comes at the right time. We use Postgres and are currently considering to add Redis behind our rust backend with websockets to enable push notifications for turtle (turtle.community), but are not quite sure if this is the right way.

Does anyone have some inspiration or experience on that topic?

Collapse
zchtodd profile image
zchtodd

I'm not sure if it fits your use case perfectly, but take a look at NOTIFY, introduced in PG 9.0 -- it's basically a push notification mechanism built into PostgreSQL.

Collapse
thojest profile image
thojest

thx for the info, will have a look!

Collapse
citizen428 profile image
Michael Kohl

Not Rust, but I wrote a post about using LISTEN/NOTIFY and a small Ruby app to push DB updates over a websocket:

Collapse
thojest profile image
thojest

thx a lot :)

Collapse
markpieszak profile image
Mark Pieszak

I'm loving your articles Jason, great job again !
It's hard to find in-depth and super high-quality articles like these in this day and age :)

Collapse
mostlyjason profile image
Jason Skowronski Author

This is an amazing complement! Mind if I use this quote on my site?

Collapse
markpieszak profile image
Mark Pieszak

Of course! My pleasure :)

Anytime, keep up the great work!

Collapse
robconery profile image
Rob Conery

Hey Jason great post! Quick thing on the full text example you have: if you use plianto_tsquery it will split the words for you and add the AND. If you use phraseto_tsquery it will apply a positional argument (<->) instead of the &, which is great if you're looking for a name or place. The new websearch_to_tsquery (in PG 11) is a great general purpose query builder as well.

Collapse
davedecahedron profile image
David Howell

This is a great article!

You mention “Use multi-column indexes sparingly“, which is generally good advice, however I would qualify that by saying avoid redundant indexes. It’s important to know that multicolumn (btree) indexes have a specific ordering to the columns. If you don’t filter or join a table using one or more of the columns in that order then it can’t use the index.

For example an index on columns a,b. If you only filter on column b then this index can’t be used. It can be used if you filter by just a, or both a and b. Following on from that, if you frequently filter by both a and b then this is a good index to have. If you also have both a single column index on a, and a multicolumn index on a,b then in that specific case the single column index is the redundant one.

Collapse
sonnk profile image
Nguyen Kim Son

Great article Jason! It's really important to limit the number of technologies used in a project and not follow trends blindly! This is quite a similar analogy to the microservice vs monolith debate.

Collapse
rhamedy profile image
Rafiullah Hamedy

Great write up Jason. I wasn't aware of PL/Python until now. Definitely a big fun of the JSON data types and have never used hstore but, thank you for sharing the use case.

There is also Postgres-XL that seems to offer a lot and only 5 years old compared to 3 decade-old PostgreSQL.

Collapse
ccleary00 profile image
Corey Cleary

I wasn't aware of tsquery, thanks for pointing that out! - I'm working on a project right now where full on Elasticsearch is probably overkill, but needed something more robust than just doing LIKE

Collapse
brianjohnsonsr profile image
Brian Johnson

I think another really important PG plugin to add is TimeScaleDB -- the ability to add efficient time series data collection/query to Postgres is awesome.

Collapse
scottw profile image
Scott Watermasysk

100% especially on search. Search in Postgres is very underrated.

Collapse
amatosg profile image
Alejandro Matos 🇵🇪

This article was definitive, it was the last straw. I have been thinking about switching my Java application to postgres from mysql and now I'm in the middle of the process using pgloader and everything has worked as expected.

Yet I have an unresolved question: does it make sense to have a 2nd level cache in Hibernate when postgresql already has caching? thanks!

Collapse
mostlyjason profile image
Jason Skowronski Author

Thanks I'm glad enjoyed it! The Hibernate second level cache lives on your application server, whereas the postgresql cache lives on the database of course. This matters because it's faster to retrieve or update data already stored on your application server. It reduces network and database load by removing duplicate queries and batching writes. This is great for applications with many reads and infrequent writes, or cases where eventual consistency on writes is acceptable.

Collapse
cwreacejr profile image
Charles Reace

Don't forget Materialized Views: can be an awesome way to optimize searches on things that you might be tempted to dump into some NoSQL text-based search tool.

Collapse
thejoezack profile image
Joe Zack

Thanks for the post, I learned a lot of new things! I'd also like to mention that OLAP style queries can be rough with lots of data.

Collapse
davedecahedron profile image
David Howell

I am also thinking about this. RDBMS generally are good at most types of workloads but mixed workloads like OLTP and OLAP on the same system will interfere with each other.

How do people do embedded reporting AND transactions in modern web apps?

NoSQL solutions like ElasticSearch are mentioned but they seem more appropriate for search. Data warehouse solutions like Snowflake, BigQuery, Redshift are good for internal analytics and reporting but they just don’t have the concurrency to support direct queries from customer facing apps.

How else to do this without complex data pipelines or complex infrastructure involving Kafka?

Collapse
thejoezack profile image
Joe Zack

I do it with a complex data pipeline and infrastructure involving Kafka!

I've got my eye on Apache Druid, though I haven't spent any real time (sorry, pun totally intended) with it.

Thread Thread
davedecahedron profile image
David Howell

I’ve looked at Druid, also considering MemSQL , ClickHouse and others

Collapse
steph_baltus profile image
Stephanie Baltus

Hey, many thanks for this article, I'm working with Postgres for a while now, but still, I've learned a lot of great stuff here !

I just wanted to add that Postgres also offers HyperLogLog as a data type, just by adding an extension (see this great article for more details.)

Collapse
idoshamun profile image
Ido Shamun

Wow! An awesome review of Postgres important features.
I can't stress out more how devs sometimes just follow the hype without thinking. When someone suggests a NoSQL data store, I always ask her/him to convince me why SQL doesn't work here.

Collapse
fultonbrowne profile image
Fulton Browne

Great article, I use postgres on heroku for database stuff all the time, I love it and I recommend it all the time.

Collapse
aurelmegn profile image
Aurel

Thanks for this post,
I wrote about using postgres to setup a distributed database. Here is the link for those who are interested dev.to/sh1ftsh/setting-up-distribu...

Collapse
okbrown profile image
Orlando Brown

Where have I read this article before? Feels like either this is a clone or the other is a clone. Brb...

Collapse
okbrown profile image
Orlando Brown

Found it, it's the same guy. Fair play!
infoq.com/articles/postgres-handle...

Thanks for the other one, it was very useful.

Collapse
mostlyjason profile image
Jason Skowronski Author

Maybe InfoQ or Hackernoon? I posted it there as well :)

Collapse
hasii2011 profile image
Collapse
zzcoder profile image
zzcoder

Just want point out one error (or misunderstanding), B-Tree is not binary as stated in "B-tree indexes B-tree indexes are binary trees"

Collapse
mostlyjason profile image
Jason Skowronski Author

Thanks I meant to say "balanced trees"

Collapse
codethug profile image
Nicolas Quijano

More in-depth articles of this kind will make this old code bum happy ;)

Collapse
zoechi profile image
Günter Zöchbauer

Nice article!

Nit - there is one redundant "are" in "Extensions are to Postgres are"

Collapse
mostlyjason profile image
Collapse
fil2fip profile image
fil2fip

Nice article ! Thank you.

I think there'e a typo in the check strings length example (b- vs b=).

Collapse
mostlyjason profile image
Collapse
kelsonpw profile image
Kelson Warner

The first paragraph is repeated twice

Collapse
mostlyjason profile image
Jason Skowronski Author

Doh! Thanks I fixed it :)

Collapse
vitorscassiano_34 profile image
Vitor Cassiano

Great article.
Congratz!

Collapse
pavelbisse profile image
Pavel Bisse

Hey Jason, thanks for nice post!
There is a HyperLogLog PostgreSQL extension btw.
github.com/citusdata/postgresql-hll

Collapse
alfuken profile image
Info Comment marked as low quality/non-constructive by the community. View code of conduct
Bohdan Schepansky

Yyyyeah... You love PG... Until you need to do a simple dumb forking count on it. That's where PG sucks big time.

Collapse
davedecahedron profile image
David Howell

Is “forking” count some special operation or just a nice word in place of swearing?

Regarding (exact) count, pretty much every system has trouble doing this quickly and on most cases you really don’t need an exact count. This is especially true for medium to large data.

Table/index statistics that are kept up to date will give a good approximation.

HyperLogLog was one option mentioned which will give good-enough approximations, another approach is log-normal histograms. I don’t think this challenge is unique to PG.

Collapse
alfuken profile image
Info Comment marked as low quality/non-constructive by the community. View code of conduct
Bohdan Schepansky

Yes, it's a replacement word, and a reference to "The Good Place" TV series.

And regarding count, no system suffer as much from it as PG. Approximations are rarely an option, and while this challenge is definitely not unique to PG, the PG is doing it worst of all DBMS. Except, maybe, sqlite, and even that I'm not sure.
Such a simple functionality, but can't be implemented properly for more than a dozen of years already. PG's count can be 15x times slower than that of MySQL and 70x slower than Oracle or SQLServer. And that sheet has been there for ages. And is till there.