Jason Skowronski for Heroku

Posted on Oct 9, 2019 • Edited on Oct 16, 2019

Postgres Is Underrated—It Handles More than You Think

#architecture #webdev #devops #database

Thinking about scaling beyond your Postgres cluster and adding another data store like Redis or Elasticsearch? Before adopting a more complex infrastructure, take a minute and think again. It’s quite possible to get more out of an existing Postgres database. It can scale for heavy loads and offers powerful features which are not obvious at first sight. For example, its possible to enable in-memory caching, text search, specialized indexing, and key-value storage.

After reading this article, you may want to list down the features you want from your data store and check if Postgres will be a good fit for them. It’s powerful enough for most applications.

Why Adding Another Data Store is Not Always a Good Idea

As Fred Brooks put it in The Mythical Man-Month: "The programmer, like the poet, works only slightly removed from pure thought-stuff. [They] build castles in the air, from air, creating by exertion of the imagination."

Adding more pieces to those castles, and getting lost in the design, is endlessly fascinating; however, in the real world, building more castles in the air can get in your way. The same holds true for the latest hype in data stores. There are several advantages to choosing boring technology:

If someone new joins your team, can they easily make sense of your different data stores?
When you or another team member come back a year later, could they quickly pick up how the system works?
If you need to change your system or add features, how many pieces do you have to move around?
Have you factored in maintenance costs, security, and upgrades?
Have you accounted for the unknowns and failure modes when running your new data store in production at scale?

Although it can be managed by thoughtful design, adding multiple datastores does increase complexity. Before exploring adding additional datastores, it's worth investigating what additional features your existing datastores can offer you.

Lesser-known but Powerful Features of Postgres

Many people are unaware that Postgres offers way more than just a SQL database. If you already have Postgres in your stack, why add more pieces when Postgres can do the job?

Postgres caches, too

There’s a misconception that Postgres reads and writes from disk on every query, especially when users compare it with purely in-memory data stores like Redis.

Actually, Postgres has a beautifully designed caching system with pages, usage counts, and transaction logs. Most of your queries will not need to access the disk, especially if they refer to the same data over and over again, as many queries tend to do.

The shared_buffer configuration parameter in the Postgres configuration file determines how much memory it will use for caching data. Typically it should be set to 25% to 40% of the total memory. That’s because Postgres also uses the operating system cache for its operation. With more memory, most recurring queries referring the same data set will not need to access the disk. Here is how you can set this parameter in the Postgres CLI:

ALTER SYSTEM SET shared_buffer TO = <value>

Managed database services like Heroku offer several plans where RAM (and hence cache) is a major differentiator. The free hobby version does not offer dedicated resources like RAM. Upgrade when you’re ready for production loads so you can make better use of caching.

You can also use some of the more advanced caching tools. For example, check the pg_buffercache view to see what’s occupying the shared buffer cache of your instance. Another tool to use is the pg_prewarm function which comes as part of the base installation. This function enables DBAs to load table data into either the operating system cache or the Postgres buffer cache. The process can be manual or automated. If you know the nature of your database queries, this can greatly improve application performance.

For the really brave at heart, refer to this article for an in-depth description of Postgres caching.

Text searching

Elasticsearch is excellent, but many use cases can get along just fine with Postgres for text searching. Postgres has a special data type, tsvector, and a set of functions, like to_tsvector and to_tsquery, to search quickly through text. tsvector represents a document optimized for text search by sorting terms and normalizing variants. Here is an example of the to_tsquery function:

SELECT to_tsquery('english', 'The & Boys & Girls');

  to_tsquery   
---------------
 'boy' & 'girl'

You can sort your results by relevance depending on how often and which fields your query appeared in the results. For example, you can make the title more relevant than the body. Check the Postgres documentation for details.

Functions in Postgres

Postgres provides a powerful server-side function environment in multiple programming languages.

Try to pre-process as much data as you can on the Postgres server with server-side functions. That way, you can cut down on the latency that comes from passing too much data back and forth between your application servers and your database. This approach is particularly useful for large aggregations and joins.

What’s even better is your development team can use its existing skill set for writing Postgres code. Other than the default PL/pgSQL (Postgres’ native procedural language), Postgres functions and triggers can be written in PL/Python, PL/Perl, PL/V8 (JavaScript extension for Postgres) and PL/R.

Here is an example of creating a PL/Python function for checking string lengths:

CREATE FUNCTION longer_string_length (string1 string, string2 string)
  RETURNS integer
AS $$
  a=len(string1)
  b=len(string2)
  if a > b:
    return a
  return b
$$ LANGUAGE plpythonu;

Postgres offers powerful extensions

Extensions are to Postgres what plug-ins mean in many applications. Suitable use of Postgres extensions can also mean you don’t have to work with other data stores for extra functionality. There are many extensions available and listed on the main Postgres website.

Geospatial Data

PostGIS is a specialized extension for Postgres used for geospatial data manipulation and running location queries in SQL. It’s widely popular among GIS application developers who use Postgres. A great beginner’s guide to using PostGIS can be found here.

The code snippet below shows how we are adding the PostGIS extension to the current database. From the OS, we run these commands to install the package (assuming you are using Ubuntu):

$ sudo add-apt-repository ppa:ubuntugis/ppa
$ sudo apt-get update
$ sudo apt-get install postgis

After that, log in to your Postgres instance and install the extension:

CREATE EXTENSION postgis;
CREATE EXTENSION postgis_topology;

If you want to check what extensions you have in the current database, run this command:

SELECT * FROM pg_available_extensions;

Key-Value Data Type

The Postgres hstore extension allows storing and searching simple key-value pairs. This tutorial provides an excellent overview of how to work with hstore data type.

Semi-structured Data Types

There are two native data types for storing semi-structured data in Postgres: JSON and XML. The JSON data type can host both native JSON and its binary form (JSONB). The latter can significantly improve query performance when it is searched. As you can see below, it can convert JSON strings to native JSON objects:

SELECT '{"product1": ["blue", "green"], "tags": {"price": 10, "discounted": false}}'::json;

json                       
---------------------------------------------------------------------
 {"product1": ["blue", "green"], "tags": {"price": 10, "discounted": false}}

Tips for Scaling Postgres

If you’re considering switching off Postgres due to performance reasons, first see how far you can get with the optimizations it offers. Here we'll assume you've done the basics, like creating appropriate indexes. Postgres offers many advanced features, and while the changes are small they can make a big difference, especially if it keeps you from complicating your infrastructure.

Don’t over-index

Avoid unnecessary indexes. Use multi-column indexes sparingly. Too many indexes take up extra memory that crowd out better uses of the Postgres cache, which is crucial for performance.

Using a tool like EXPLAIN ANALYZE might surprise you by how often the query planer actually chooses sequential table scans. Since much of your table’s row data is already cached, oftentimes these elaborate indexes aren’t even used.

That said, if you do find slow queries, the first and most obvious solution is to see if the table is missing an index. Indexes are vital, but you have to use them correctly.

Partial indexes save space

A partial index can save space by specifying which values get indexed. For example, you want to order by a user’s signup date, but only care about the users who have signed up:

CREATE INDEX user_signup_date ON users(signup_date) WHERE is_signed_up;

Understanding Postgres index types

Choosing the right index for your data can improve performance. Here are some common index types and when you should use each one.

B-tree indexes B-tree indexes are balanced trees that are used to sort data efficiently. They’re the default if you use the INDEX command. Most of the time, a B-tree index suffices. As you scale, inconsistencies can be a larger problem, so use the amcheck extension periodically.
BRIN indexes A Block Range INdex (BRIN) can be used when your table is naturally already sorted by a column, and you need to sort by that column. For example, for a log table that was written sequentially, setting a BRIN index on the timestamp column lets the server know that the data is already sorted.
Bloom filter index A bloom index is perfect for multi-column queries on big tables where you only need to test for equality. It uses a special mathematical structure called a bloom filter that’s based on probability and uses significantly less space.

 CREATE INDEX i ON t USING bloom(col1, col2, col3);
 SELECT * from t WHERE col1 = 5 AND col2 = 9 AND col3 = 'x';

GIN and GiST indexes \ Use a GIN or GiST index for efficient indexes based on composite values like text, arrays, and JSON.

When Do You Need Another Data Store?

There are legitimate cases for adding another datastore beyond Postgres.

Special data types

Some data stores give you data types that you just can’t get on Postgres. For example, the linked list, bitmaps, and HyperLogLog functions in Redis are not available on Postgres.

At a previous startup, we had to implement a frequency cap, which is a counter for unique users on a website based on session data (like cookies). There might be millions or tens of millions of users visiting a website. Frequency capping means you only show each user your ad once per day.

Redis has a HyperLogLog data type that is perfect for a frequency cap. It approximates set membership with a very small error rate, in exchange for O(1) time and a very small memory footprint. PFADD adds an element to a HyperLogLog set. It returns 1 if your element is not in the set already, and 0 if it is in the set.

PFADD user_ids uid1
(integer) 1
PFADD user_ids uid2
(integer) 1
PFADD user_ids uid1
(integer) 0

Heavy real-time processing

If you’re in a situation with many pub-sub events, jobs, and dozens of workers to coordinate, you may need a more specialized solution like Apache Kafka. LinkedIn engineers originally developed Kafka to handle new user events like clicks, invitations, and messages, and allow different workers to handle message passing and jobs to process the data.

Instant full-text searching

If you have a real-time application under heavy load with more than ten searches going on at a time, and you need features like autocomplete, then you may benefit more from a specialized text solution like Elasticsearch.

Conclusion

Redis, Elasticsearch, and Kafka are powerful, but sometimes adding them does more harm than good. You may be able to get the capabilities you need with Postgres by taking advantage of the lesser-known features we’ve covered here. Ensuring that you are getting the most out of Postgres can save you time and help you avoid added complexity and risks.

To save even more time and headaches, consider using a managed service like Heroku Postgres. Scaling up is a simple matter of adding additional follower replicas, high availability can be turned on with a single click, and Heroku operates it for you. If you really need to expand beyond Postgres, the other data stores that we mentioned above, such as Redis, Apache Kafka and Elasticsearch, can all be easily provisioned on Heroku. Go ahead and build your castles in the air―but anchor them to a reliable foundation, so you can dream about a better product and customer experience.

For more information on Postgres, listen to Cloud Database Workloads with Jon Daniel on Software Engineering Daily.

Top comments (42)

thojest • Oct 9 '19

Hi, this is a really nice article. The funny thing is, it comes at the right time. We use Postgres and are currently considering to add Redis behind our rust backend with websockets to enable push notifications for turtle (turtle.community), but are not quite sure if this is the right way.

Does anyone have some inspiration or experience on that topic?

zchtodd • Oct 9 '19

I'm not sure if it fits your use case perfectly, but take a look at NOTIFY, introduced in PG 9.0 -- it's basically a push notification mechanism built into PostgreSQL.

thojest • Oct 9 '19

thx for the info, will have a look!

Utkarsh Kumar Raut • May 10 '21

Really nice project, Turtle.

Mark Pieszak • Oct 9 '19

I'm loving your articles Jason, great job again !
It's hard to find in-depth and super high-quality articles like these in this day and age :)

Jason Skowronski • Oct 10 '19

This is an amazing complement! Mind if I use this quote on my site?

Mark Pieszak • Oct 10 '19

Of course! My pleasure :)

Anytime, keep up the great work!

Brian Johnson • Oct 11 '19

I think another really important PG plugin to add is TimeScaleDB -- the ability to add efficient time series data collection/query to Postgres is awesome.

David Howell • Oct 16 '19 • Edited

This is a great article!

You mention “Use multi-column indexes sparingly“, which is generally good advice, however I would qualify that by saying avoid redundant indexes. It’s important to know that multicolumn (btree) indexes have a specific ordering to the columns. If you don’t filter or join a table using one or more of the columns in that order then it can’t use the index.

For example an index on columns a,b. If you only filter on column b then this index can’t be used. It can be used if you filter by just a, or both a and b. Following on from that, if you frequently filter by both a and b then this is a good index to have. If you also have both a single column index on a, and a multicolumn index on a,b then in that specific case the single column index is the redundant one.

Nguyen Kim Son • Oct 10 '19

Great article Jason! It's really important to limit the number of technologies used in a project and not follow trends blindly! This is quite a similar analogy to the microservice vs monolith debate.

Rafiullah Hamedy • Oct 15 '19

Great write up Jason. I wasn't aware of PL/Python until now. Definitely a big fun of the JSON data types and have never used hstore but, thank you for sharing the use case.

There is also Postgres-XL that seems to offer a lot and only 5 years old compared to 3 decade-old PostgreSQL.

Rob Conery • Oct 16 '19

Hey Jason great post! Quick thing on the full text example you have: if you use plianto_tsquery it will split the words for you and add the AND. If you use phraseto_tsquery it will apply a positional argument (<->) instead of the &, which is great if you're looking for a name or place. The new websearch_to_tsquery (in PG 11) is a great general purpose query builder as well.

Corey Cleary • Oct 9 '19

I wasn't aware of tsquery, thanks for pointing that out! - I'm working on a project right now where full on Elasticsearch is probably overkill, but needed something more robust than just doing LIKE

Scott Watermasysk • Oct 9 '19

100% especially on search. Search in Postgres is very underrated.

Alejandro • Nov 22 '19 • Edited

This article was definitive, it was the last straw. I have been thinking about switching my Java application to postgres from mysql and now I'm in the middle of the process using pgloader and everything has worked as expected.

Yet I have an unresolved question: does it make sense to have a 2nd level cache in Hibernate when postgresql already has caching? thanks!

Jason Skowronski • Nov 22 '19

Thanks I'm glad enjoyed it! The Hibernate second level cache lives on your application server, whereas the postgresql cache lives on the database of course. This matters because it's faster to retrieve or update data already stored on your application server. It reduces network and database load by removing duplicate queries and batching writes. This is great for applications with many reads and infrequent writes, or cases where eventual consistency on writes is acceptable.

View full discussion (42 comments)

DEV Community