DEV Community

ScaleGrid for ScaleGrid

Posted on

Using JSONB in PostgreSQL: How to Effectively Store & Index JSON Data in PostgreSQL


JSON stands for JavaScript Object Notation. It is an open standard format which organizes data into key/value pairs and arrays detailed in RFC 7159. JSON is the most common format used by web services to exchange data, store documents, unstructured data, etc. In this post, we are going to show you tips and techniques on how to effectively store and index JSON data in PostgreSQL.

json popularity

You can also check out our Working with JSON Data in PostgreSQL vs. MongoDB webinar in partnership with PostgresConf to learn more on the topic, and check out our SlideShare page to download the slides.

Why Store JSON in PostgreSQL?

Why should a relational database even care about unstructured data? It turns out that there are a few scenarios where it is useful.

  1. Schema flexibility

    One of the main reasons to store data using the JSON format is schema flexibility. Storing your data in JSON is useful when your schema is fluid and is changing frequently. If you store each of the keys as columns, it will result in frequent DML operations - this can be difficult when your data set is large - for example, event tracking, analytics, tags, etc. Note: If a particular key is always present in your document, it might make sense to store it as a first class column. We discuss more about this approach in section "JSON Patterns & Antipatterns" below.
  2. Nested objects

    If your data set has nested objects (single or multi-level), in some cases, it is easier to handle them in JSON instead of denormalizing the data into columns or multiple tables.
  3. Syncing with external data sources

    Often times an external system is providing data as JSON, so it might be a temporary store before data is ingested into other parts of the system. For example, Stripe transactions.

Timeline of JSON Support in PostgreSQL

JSON support in PostgreSQL was introduced in 9.2 and has steadily improved in every release going forward.

  • Wave 1: PostgreSQL 9.2  (2012) added support for JSON data type

    JSON database in 9.2 was fairly limited (and probably overhyped at that point) - basically a glorified string with some JSON validation thrown in. It is useful to validate incoming JSON and store in the database. More details are provided below.
  • Wave 2: PostgreSQL 9.4 (2014) added support for JSONB data type

    JSONB stands for "JSON Binary" or "JSON better" depending on whom you ask. It is a decomposed binary format to store JSON. JSONB supports indexing the JSON data, and is very efficient at parsing and querying the JSON data. In most cases, when you work with JSON in PostgreSQL, you should be using JSONB.
  • Wave 3: PostgreSQL 12 (2019) added support for SQL/JSON standard and JSONPATH queries

    JSONPath brings a powerful JSON query engine to PostgreSQL.

PostgreSQL Version Timeline for JSON Support - ScaleGrid Blog

When Should You Use JSON vs. JSONB?

In most cases, JSONB is what you should be using. However, there are some specific cases where JSON works better:

  • JSON preserves the original formatting (a.k.a whitespace) and ordering of the keys.
  • JSON preserves duplicate keys.
  • JSON is faster to ingest vs. JSONB - however, if you do any further processing, JSONB will be faster.

For example, if you're just ingesting JSON logs and not querying them in any way, then JSON might be a better option for you. For the purposes of this blog, when we refer to JSON support in PostgreSQL, we will refer to JSONB going forward.

JSONB Patterns & Antipatterns

If PostgreSQL has great support for JSONB, why do we need columns anymore? Why not just create a table with a JSONB blob and get rid of all columns like the schema below:

CREATE TABLE test(id int, data JSONB, PRIMARY KEY (id));

At the end of the day, columns are still the most efficient technique to work with your data. JSONB storage has some drawbacks vs. traditional columns:

  • PostreSQL does not store column statistics for JSONB columns

    PostgreSQL maintains statistics about the distributions of values in each column of the table - most common values (MCV), NULL entries, histogram of distribution. Based on this data, the PostgreSQL query planner makes smart decisions on the plan to use for the query. At this point, PostgreSQL does not store any stats for JSONB columns or keys. This can sometimes result in poor choices like using nested loop joins vs. hash joins, etc. A more detailed example of this is provided in this blog post - When To Avoid JSONB In A PostgreSQL Schema.
  • JSONB storage results in a larger storage footprint

    JSONB storage does not deduplicate the key names in the JSON. This can result in considerably larger storage footprint compared to MongoDB BSON on WiredTiger or traditional column storage. I ran a simple test with the below JSONB model storing about 10 million rows of data, and here are the results - In some ways this is similar to the MongoDB MMAPV1 storage model where the keys in JSONB were stored as-is without any compression. One long-term fix is to move the key names to a table level dictionary and refer this dictionary instead of storing the key names repeatedly. Until then, the workaround might be to use more compact names (unix-style) instead of more descriptive names. For example, if you're storing millions of instances of a particular key, it would be better storage-wise to name it "pb" instead of "publisherName".

The most efficient way to leverage JSONB in PostgreSQL is to combine columns and JSONB. If a key appears very frequently in your JSONB blobs, it is probably better off being stored as a column. Use JSONB as a "catch all" to handle the variable parts of your schema while leveraging traditional columns for fields that are more stable.

JSONB Data Structures

JSON Implementation Structure

Both JSONB and MongoDB BSON are essentially tree structures, using multi-level nodes to store the parsed JSONB data. MongoDB BSON has a very similar structure.

Images source

JSONB Implementation Structure

JSONB & TOAST

Another important consideration for storage is how JSONB interacts with TOAST (The Oversize Attribute Storage Technique). Typically, when the size of your column exceeds the TOAST_TUPLE_THRESHOLD (2kb default), PostgreSQL will attempt to compress the data and fit in 2kb. If that doesn't work, the data is moved to out-of-line storage. This is what they call "TOASTing" the data. When the data is fetched, the reverse process "deTOASTting" needs to happen. You can also control the TOAST storage strategy:

  • Extended - Allows for out-of-line storage and compression (using pglz). This is the default option.
  • External - Allows for out-of-line storage, but not compression.

If you're experiencing delays due to the TOAST compression or decompression, one option is to proactively set the column storage to 'EXTENDED'. For all of the details, please refer to this PostgreSQL doc.

JSONB Operators & Functions

PostgreSQL provides a variety of operators to work on JSONB. From the docs:

Operator Description
-> Get JSON array element (indexed from zero, negative integers count from the end)
-> Get JSON object field by key
->> Get JSON array element as text
->> Get JSON object field as text
#> Get JSON object at the specified path
#>> Get JSON object at the specified path as text
@> Does the left JSON value contain the right JSON path/value entries at the top level?
Are the left JSON path/value entries contained at the top level within the right JSON value?
? Does the string exist as a top-level key within the JSON value?
?| Do any of these array strings exist as top-level keys?
?& Do all of these array strings exist as top-level keys?
|| Concatenate two jsonb values into a new jsonb value
- Delete key/value pair or string element from left operand. Key/value pairs are matched based on their key value.
- Delete multiple key/value pairs or string elements from left operand. Key/value pairs are matched based on their key value.
- Delete the array element with specified index (Negative integers count from the end). Throws an error if top level container is not an array.
#- Delete the field or element with specified path (for JSON arrays, negative integers count from the end)
@? Does JSON path return any item for the specified JSON value?
@@ Returns the result of JSON path predicate check for the specified JSON value. Only the first item of the result is taken into account. If the result is not Boolean, then null is returned.

PostgreSQL also provides a variety of Creation Functions and Processing Functions to work with the JSONB data.

JSONB Indexes

JSONB provides a wide array of options to index your JSON data. At a high-level, we are going to dig into 3 different types of indexes - GIN, BTREE and HASH. Not all index types support all operator classes, so planning is needed to design your indexes based on the type of operators and queries that you plan on using.

GIN Indexes

GIN stands for "Generalized Inverted indexes". From the docs:

"GIN is designed for handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items. For example, the items could be documents, and the queries could be searches for documents containing specific words."

PostgreSQL GIN Structure - ScaleGrid Blog

GIN supports two operator classes:

  • jsonb_ops (default) - ?, ?|, ?&, @>, @@, @? [Index each key and value in the JSONB element]
  • jsonb_pathops - @>, @@, @? [Index only the values in the JSONB element]
CREATE INDEX datagin ON books USING gin (data);

Existence Operators (?, ?|, ?& )

These operators can be used to check for the existence of top-level keys in the JSONB. Let's create a GIN index on the data JSONB column. For example, find all books that are available in braille. The JSON looks something like this:

"{"tags": {"nk594127": {"ik71786": "iv678771"}}, "braille": false, "keywords": ["abc", "kef", "keh"], "hardcover": true, "publisher": "EfgdxUdvB0", "criticrating": 1}
demo=# select * from books where data ? 'braille';
id | author | isbn | rating | data

--------------+-----------------+------------+--------+------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------
1000005 | XEI7xShT8bPu6H7 | 2kD5XJDZUF | 0 | {"tags": {"nk455671": {"ik937456": "iv506075"}}, "braille": true, "keywords": ["abc", "kef", "keh"], "hardcover": false, "publisher": "zSfZIAjGGs", "
criticrating": 4}
.....

demo=# explain analyze select * from books where data ? 'braille';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on books (cost=12.75..1005.25 rows=1000 width=158) (actual time=0.033..0.039 rows=15 loops=1)
Recheck Cond: (data ? 'braille'::text)
Heap Blocks: exact=2
-> Bitmap Index Scan on datagin (cost=0.00..12.50 rows=1000 width=0) (actual time=0.022..0.022 rows=15 loops=1)
Index Cond: (data ? 'braille'::text)
Planning Time: 0.102 ms
Execution Time: 0.067 ms
(7 rows)

As you can see from the explain output, the GIN index that we created is being used for the search. What if we wanted to find books that were in braille or in hardcover?

demo=# explain analyze select * from books where data ?| array['braille','hardcover'];
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on books (cost=16.75..1009.25 rows=1000 width=158) (actual time=0.029..0.035 rows=15 loops=1)
Recheck Cond: (data ?| '{braille,hardcover}'::text[])
Heap Blocks: exact=2
-> Bitmap Index Scan on datagin (cost=0.00..16.50 rows=1000 width=0) (actual time=0.023..0.023 rows=15 loops=1)
Index Cond: (data ?| '{braille,hardcover}'::text[])
Planning Time: 0.138 ms
Execution Time: 0.057 ms
(7 rows)

The GIN index supports the "existence" operators only on "top-level" keys. If the key is not at the top level, then the index will not be used. It will result in a sequential scan:

demo=# select * from books where data->'tags' ? 'nk455671';
id | author | isbn | rating | data

--------------+-----------------+------------+--------+------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------
1000005 | XEI7xShT8bPu6H7 | 2kD5XJDZUF | 0 | {"tags": {"nk455671": {"ik937456": "iv506075"}}, "braille": true, "keywords": ["abc", "kef", "keh"], "hardcover": false, "publisher": "zSfZIAjGGs", "
criticrating": 4}
685122 | GWfuvKfQ1PCe1IL | jnyhYYcF66 | 3 | {"tags": {"nk455671": {"ik615925": "iv253423"}}, "publisher": "b2NwVg7VY3", "criticrating": 0}
(2 rows)

demo=# explain analyze select * from books where data->'tags' ? 'nk455671';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Seq Scan on books (cost=0.00..38807.29 rows=1000 width=158) (actual time=0.018..270.641 rows=2 loops=1)
Filter: ((data -> 'tags'::text) ? 'nk455671'::text)
Rows Removed by Filter: 1000017
Planning Time: 0.078 ms
Execution Time: 270.728 ms
(5 rows)

The way to check for existence in nested docs is to use "expression indexes". Let's create an index on data->tags:

CREATE INDEX datatagsgin ON books USING gin (data->'tags');
demo=# select * from books where data->'tags' ? 'nk455671';
id | author | isbn | rating | data

--------------+-----------------+------------+--------+------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------
1000005 | XEI7xShT8bPu6H7 | 2kD5XJDZUF | 0 | {"tags": {"nk455671": {"ik937456": "iv506075"}}, "braille": true, "keywords": ["abc", "kef", "keh"], "hardcover": false, "publisher": "zSfZIAjGGs", "
criticrating": 4}
685122 | GWfuvKfQ1PCe1IL | jnyhYYcF66 | 3 | {"tags": {"nk455671": {"ik615925": "iv253423"}}, "publisher": "b2NwVg7VY3", "criticrating": 0}
(2 rows)

demo=# explain analyze select * from books where data->'tags' ? 'nk455671';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on books (cost=12.75..1007.75 rows=1000 width=158) (actual time=0.031..0.035 rows=2 loops=1)
Recheck Cond: ((data ->'tags'::text) ? 'nk455671'::text)
Heap Blocks: exact=2
-> Bitmap Index Scan on datatagsgin (cost=0.00..12.50 rows=1000 width=0) (actual time=0.021..0.021 rows=2 loops=1)
Index Cond: ((data ->'tags'::text) ? 'nk455671'::text)
Planning Time: 0.098 ms
Execution Time: 0.061 ms
(7 rows)

Note: An alternative here is to use the @> operator:

select * from books where data @> '{"tags":{"nk455671":{}}}'::jsonb;

However, this only works if the value is an object. So, if you're unsure if the value is an object or a primitive value, it could lead to incorrect results.

Path Operators @>,

The "path" operator can be used for multi-level queries of your JSONB data. Let's use it similar to the ? operator above:

select * from books where data @> '{"braille":true}'::jsonb;
demo=# explain analyze select * from books where data @> '{"braille":true}'::jsonb;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on books (cost=16.75..1009.25 rows=1000 width=158) (actual time=0.040..0.048 rows=6 loops=1)
Recheck Cond: (data @> '{"braille": true}'::jsonb)
Rows Removed by Index Recheck: 9
Heap Blocks: exact=2
-> Bitmap Index Scan on datagin (cost=0.00..16.50 rows=1000 width=0) (actual time=0.030..0.030 rows=15 loops=1)
Index Cond: (data @> '{"braille": true}'::jsonb)
Planning Time: 0.100 ms
Execution Time: 0.076 ms
(8 rows)

The path operators support querying nested objects or top-level objects:

demo=# select * from books where data @> '{"publisher":"XlekfkLOtL"}'::jsonb;
id | author | isbn | rating | data
----------+-----------------+------------+--------+-------------------------------------------------------------------------------------
346 | uD3QOvHfJdxq2ez | KiAaIRu8QE | 1 | {"tags": {"nk88": {"ik37": "iv161"}}, "publisher": "XlekfkLOtL", "criticrating": 3}
(1 row)

demo=# explain analyze select * from books where data @> '{"publisher":"XlekfkLOtL"}'::jsonb;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on books (cost=16.75..1009.25 rows=1000 width=158) (actual time=0.491..0.492 rows=1 loops=1)
Recheck Cond: (data @> '{"publisher": "XlekfkLOtL"}'::jsonb)
Heap Blocks: exact=1
-> Bitmap Index Scan on datagin (cost=0.00..16.50 rows=1000 width=0) (actual time=0.092..0.092 rows=1 loops=1)
Index Cond: (data @> '{"publisher": "XlekfkLOtL"}'::jsonb)
Planning Time: 0.090 ms
Execution Time: 0.523 ms

The queries can be multi-level as well:

demo=# select * from books where data @> '{"tags":{"nk455671":{"ik937456":"iv506075"}}}'::jsonb;
id | author | isbn | rating | data

--------------+-----------------+------------+--------+------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------
1000005 | XEI7xShT8bPu6H7 | 2kD5XJDZUF | 0 | {"tags": {"nk455671": {"ik937456": "iv506075"}}, "braille": true, "keywords": ["abc", "kef", "keh"], "hardcover": false, "publisher": "zSfZIAjGGs", "
criticrating": 4}
(1 row)

GIN Index "pathops" Operator Class

GIN also supports a "pathops" option to reduce the size of the GIN index. When you use the pathops option, the only operator support is the "@>" so you need to be careful with your queries. From the docs:

"The technical difference between a jsonb_ops and a jsonb_path_ops GIN index is that the former creates independent index items for each key and value in the data, while the latter creates index items only for each value in the data"

You can create a GIN pathops index as follows:

CREATE INDEX dataginpathops ON books USING gin (data jsonb_path_ops);

On my small dataset of 1 million books, you can see that the pathops GIN index is smaller - you should test with your dataset to understand the savings:

public | dataginpathops | index | sgpostgres | books | 67 MB |
public | datatagsgin | index | sgpostgres | books | 84 MB |

Let's rerun our query from before with the pathops index:

<

code>demo=# select * from books where data @&amp;amp;amp;gt; '{"tags":{"nk455671":{"ik937456":"iv506075"}}}'::jsonb;
id | author | isbn | rating | data

--------------+-----------------+------------+--------+------------------------------------------------------------------------------------------------------------------------------------------------------

1000005 | XEI7xShT8bPu6H7 | 2kD5XJDZUF | 0 | {"tags": {"nk455671": {"ik937456": "iv506075"}}, "braille": true, "keywords": ["abc", "kef", "keh"], "hardcover": false, "publisher": "zSfZIAjGGs", "
criticrating": 4}
(1 row)

demo=# explain select * from books where data @&amp;amp;amp;gt; '{"tags":{"nk455671":{"ik937456":"iv506075"}}}'::jsonb;

QUERY PLAN

Bitmap Heap Scan on books (cost=12.75..1005.25 rows=1000 width=158)
Recheck Cond: (data @&amp;amp;amp;gt; '{"tags": {"nk455671": {"ik937456": "iv506075"}}}'::jsonb)
-&amp;amp;amp;amp;gt; Bitmap Index Scan on dataginpathops (cost=0.00..12.50 rows=1000 width=0)
Index Cond: (data @&amp;amp;amp;gt; '{"tags": {"nk455671": {"ik937456": "iv506075"}}}'::jsonb)
(4 rows)

However, as mentioned above, the "pathops" option does not support all of the scenarios that the default operator class supports. With a "pathops" GIN index, all these queries are not able to leverage the GIN index. To summarize, you have a smaller index but it supports a more limited use case.

select * from books where data ? 'tags'; =&amp;amp;amp;amp;gt; Sequential scan
select * from books where data @&amp;amp;amp;amp;gt; '{"tags" :{}}'; =&amp;amp;amp;amp;gt; Sequential scan
select * from books where data @&amp;amp;amp;amp;gt; '{"tags" :{"k7888":{}}}' =&amp;amp;amp;amp;gt; Sequential scan

B-Tree indexes

B-tree indexes are the most common index type in relational databases. However, if you index an entire JSONB column with a B-tree index, the only useful operators are "=", <, <=, >, >=. Essentially, this can only be used for whole object comparisons, which has a very limited use case.

A more common scenario is to use B-tree "expression indexes". For a primer, refer here - Indexes on Expressions. B-tree expression indexes can support the common comparison operators '=', '<', '>', '>=', '<='. As you might recall, GIN indexes don't support these operators. Let's consider the case when we want to retrieve all books with a data->criticrating > 4. So, you would build a query something like this:

demo=# select * from books where data-&amp;amp;amp;gt;'criticrating' &amp;amp;amp;gt; 4;
ERROR: operator does not exist: jsonb &amp;amp;amp;gt;= integer
LINE 1: select * from books where data-&amp;amp;amp;gt;'criticrating'  &amp;amp;amp;gt;= 4;
^
HINT: No operator matches the given name and argument types. You might need to add explicit type casts.

Well, that doesn't work since the '->' operator returns a JSONB type. So we need to use something like this:

demo=# select * from books where (data-&amp;amp;amp;gt;'criticrating')::int4 &amp;amp;amp;gt; 4;

If you're using a version prior to PostgreSQL 11, it gets more ugly. You need to first query as text and then cast it to integer:

demo=# select * from books where (data-&amp;amp;amp;gt;'criticrating')::int4 &amp;amp;amp;gt; 4;

For expression indexes, the index needs to be an exact match with the query expression. So, our index would look something like this:

demo=# CREATE INDEX criticrating ON books USING BTREE (((data-&amp;amp;amp;gt;'criticrating')::int4));
CREATE INDEX

demo=# explain analyze select * from books where (data-&amp;amp;amp;gt;'criticrating')::int4 = 3;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Index Scan using criticrating on books (cost=0.42..4626.93 rows=5000 width=158) (actual time=0.069..70.221 rows=199883 loops=1)
Index Cond: (((data -&amp;amp;amp;gt; 'criticrating'::text))::integer = 3)
Planning Time: 0.103 ms
Execution Time: 79.019 ms
(4 rows)

demo=# explain analyze select * from books where (data-&amp;amp;amp;gt;'criticrating')::int4 = 3;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Index Scan using criticrating on books (cost=0.42..4626.93 rows=5000 width=158) (actual time=0.069..70.221 rows=199883 loops=1)
Index Cond: (((data -&amp;amp;amp;gt; 'criticrating'::text))::integer = 3)
Planning Time: 0.103 ms
Execution Time: 79.019 ms
(4 rows)

From above we can see that the BTREE index is being used as expected.

Hash Indexes

If you are only interested in the "=" operator, then Hash indexes become interesting. For example, consider the case when we are looking for a particular tag on a book. The element to be indexed can be a top level element or deeply nested.

E.g. tags->publisher = XlekfkLOtL

CREATE INDEX publisherhash ON books USING HASH ((data-&amp;amp;amp;gt;'publisher'));

Hash indexes also tend to be smaller in size than B-tree or GIN indexes. Of course, this ultimately depends on your data set.

demo=# select * from books where data-&amp;amp;amp;gt;'publisher' = 'XlekfkLOtL'
demo-# ;
id | author | isbn | rating | data
----------+-----------------+------------+--------+-------------------------------------------------------------------------------------
346 | uD3QOvHfJdxq2ez | KiAaIRu8QE | 1 | {"tags": {"nk88": {"ik37": "iv161"}}, "publisher": "XlekfkLOtL", "criticrating": 3}
(1 row)

demo=# explain analyze select * from books where data-&amp;amp;amp;gt;'publisher' = 'XlekfkLOtL';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Index Scan using publisherhash on books (cost=0.00..2.02 rows=1 width=158) (actual time=0.016..0.017 rows=1 loops=1)
Index Cond: ((data -&amp;amp;amp;gt; 'publisher'::text) = 'XlekfkLOtL'::text)
Planning Time: 0.080 ms
Execution Time: 0.035 ms
(4 rows)

Special Mention: GIN Trigram Indexes

PostgreSQL supports string matching using trigram indexes. Trigram indexes work by breaking up text into trigrams. Trigrams are basically words broken up into sequences of 3 letters. More information can be found in the documentation. GIN indexes support the “gin_trgm_ops” class that can be used to index the data in JSONB. You can choose to use expression indexes to build the trigram index on a particular column.

CREATE EXTENSION pg_trgm;
CREATE INDEX publisher ON books USING GIN ((data-&amp;amp;amp;gt;'publisher') gin_trgm_ops);

demo=# select * from books where data-&amp;amp;amp;gt;'publisher' LIKE '%I0UB%';
 id |     author      |    isbn    | rating |                                      data
---------+-----------------+------------+--------+---------------------------------------------------------------------------------
  4 | KiEk3xjqvTpmZeS | EYqXO9Nwmm |      0 | {"tags": {"nk3": {"ik1": "iv1"}}, "publisher": "MI0UBqZJDt", "criticrating": 1}
(1 row)

As you can see in the query above, we can search for any arbitrary string occurring at any potion. Unlike the B-tree indexes, we are not restricted to left anchored expressions.

demo=# explain analyze select * from books where data-&amp;amp;amp;gt;'publisher' LIKE '%I0UB%';
                                                     QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on books  (cost=9.78..111.28 rows=100 width=158) (actual time=0.033..0.033 rows=1 loops=1)
   Recheck Cond: ((data -&amp;amp;amp;amp;gt; 'publisher'::text) ~~ '%I0UB%'::text)
   Heap Blocks: exact=1
   -&amp;amp;amp;gt;  Bitmap Index Scan on publisher  (cost=0.00..9.75 rows=100 width=0) (actual time=0.025..0.025 rows=1 loops=1)
         Index Cond: ((data -&amp;amp;amp;amp;gt; 'publisher'::text) ~~ '%I0UB%'::text)
 Planning Time: 0.213 ms
 Execution Time: 0.058 ms
(7 rows)

Special Mention: GIN Array Indexes

JSONB has great built-in support for indexing arrays. Let's consider an example of indexing an array of strings using a GIN index in the case when our JSONB data contains a "keyword" element and we would like to find rows with particular keywords:

{"tags": {"nk780341": {"ik397357": "iv632731"}}, "keywords": ["abc", "kef", "keh"], "publisher": "fqaJuAdjP5", "criticrating": 2}

CREATE INDEX keywords ON books USING GIN ((data-&amp;amp;amp;gt;'keywords') jsonb_path_ops);

demo=# select * from books where data-&amp;amp;amp;gt;'keywords' @&amp;amp;amp;gt; '["abc", "keh"]'::jsonb;
   id    |     author      |    isbn    | rating |                                                               data
--------------+-----------------+------------+--------+-----------------------------------------------------------------------------------------------------------------------------------
 1000003 | zEG406sLKQ2IU8O | viPdlu3DZm |      4 | {"tags": {"nk263020": {"ik203820": "iv817928"}}, "keywords": ["abc", "kef", "keh"], "publisher": "7NClevxuTM", "criticrating": 2}
 1000004 | GCe9NypHYKDH4rD | so6TQDYzZ3 |      4 | {"tags": {"nk780341": {"ik397357": "iv632731"}}, "keywords": ["abc", "kef", "keh"], "publisher": "fqaJuAdjP5", "criticrating": 2}
(2 rows)

demo=# explain analyze select * from books where data-&amp;amp;amp;gt;'keywords' @&amp;amp;amp;gt; '["abc", "keh"]'::jsonb;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on books  (cost=54.75..1049.75 rows=1000 width=158) (actual time=0.026..0.028 rows=2 loops=1)
   Recheck Cond: ((data -&amp;amp;amp;gt; 'keywords'::text) @&amp;amp;amp;gt; '["abc", "keh"]'::jsonb)
   Heap Blocks: exact=1
   -&amp;amp;amp;gt;  Bitmap Index Scan on keywords  (cost=0.00..54.50 rows=1000 width=0) (actual time=0.014..0.014 rows=2 loops=1)
         Index Cond: ((data -&amp;amp;amp;gt; 'keywords'::text) @&amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; '["abc", "keh"]'::jsonb)
 Planning Time: 0.131 ms
 Execution Time: 0.063 ms
(7 rows)

The order of the items in the array on the right does not matter. For example, the following query would return the same result as the previous:

demo=# explain analyze select * from books where data-&amp;amp;amp;gt;'keywords' @&amp;amp;amp;gt; '["keh","abc"]'::jsonb;

All elements in the right side array of the containment operator need to be present - basically like an "AND" operator. If you want "OR" behavior, you can construct it in the WHERE clause:

demo=# explain analyze select * from books where (data-&amp;amp;amp;gt;'keywords' @&amp;amp;amp;gt; '["abc"]'::jsonb OR data-&amp;amp;amp;gt;'keywords' @&amp;amp;amp;gt; '["keh"]'::jsonb);

More details on the behavior of the containment operators with arrays can be found in the documentation.

SQL/JSON & JSONPath

SQL standard added support for JSON  in SQL - SQL/JSON Standard-2016. With the PostgreSQL 12/13 releases, PostgreSQL has one of the best implementations of the SQL/JSON standard. For more details refer to the PostgreSQL 12 announcement.

One of the core features of SQL/JSON is support for the JSONPath language to query JSONB data. JSONPath allows you to specify an expression (using a syntax similar to the property access notation in Javascript) to query your JSONB data. This makes it simple and intuitive, but is also very powerful to query your JSONB data. Think of  JSONPath as the logical equivalent of XPath for XML.

.key Returns an object member with the specified key.
[*] Wildcard array element accessor that returns all array elements.
.* Wildcard member accessor that returns the values of all members located at the top level of the current object.
.** Recursive wildcard member accessor that processes all levels of the JSON hierarchy of the current object and returns all the member values, regardless of their nesting level.

Refer to JSONPath documentation for the full list of operators. JSONPath also supports a variety of filter expressions.

JSONPath Functions

PostgreSQL 12 provides several functions to use JSONPath to query your JSONB data. From the docs:

  • jsonb_path_exists - Checks whether JSONB path returns any item for the specified JSON value.
  • jsonb_path_match - Returns the result of JSONB path predicate check for the specified JSONB value. Only the first item of the result is taken into account. If the result is not Boolean, then null is returned.
  • jsonb_path_query - Gets all JSONB items returned by JSONB path for the specified JSONB value. There are also a couple of other variants of this function that handle arrays of objects.

Let's start with a simple query - finding books by publisher:

demo=# select * from books where data @@ '$.publisher == "ktjKEZ1tvq"';
id | author | isbn | rating | data
--------------+-----------------+------------+--------+----------------------------------------------------------------------------------------------------------------------------------
1000001 | 4RNsovI2haTgU7l | GwSoX67gLS | 2 | {"tags": {"nk542369": {"ik55240": "iv305393"}}, "keywords": ["abc", "def", "geh"], "publisher": "ktjKEZ1tvq", "criticrating": 0}
(1 row)

demo=# explain analyze select * from books where data @@ '$.publisher == "ktjKEZ1tvq"';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on books (cost=21.75..1014.25 rows=1000 width=158) (actual time=0.123..0.124 rows=1 loops=1)
Recheck Cond: (data @@ '($."publisher" == "ktjKEZ1tvq")'::jsonpath)
Heap Blocks: exact=1
-&amp;amp;amp;gt; Bitmap Index Scan on datagin (cost=0.00..21.50 rows=1000 width=0) (actual time=0.110..0.110 rows=1 loops=1)
Index Cond: (data @@ '($."publisher" == "ktjKEZ1tvq")'::jsonpath)
Planning Time: 0.137 ms
Execution Time: 0.194 ms
(7 rows)

You can rewrite this expression as a JSONPath filter:

demo=# select * from books where jsonb_path_exists(data,'$.publisher ?(@ == "ktjKEZ1tvq")');

You can also use very complex query expressions. For example, let's select books where print style = hardcover and price = 100:

select * from books where jsonb_path_exists(data, '$.prints[*] ?(@.style=="hc" &amp;amp;amp;amp;&amp;amp;amp;amp; @.price == 100)');

However, index support for JSONPath is very limited at this point - this makes it dangerous to use JSONPath in the where clause. JSONPath support for indexes will be improved in subsequent releases.

demo=# explain analyze select * from books where jsonb_path_exists(data,'$.publisher ?(@ == "ktjKEZ1tvq")');
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Seq Scan on books (cost=0.00..36307.24 rows=333340 width=158) (actual time=0.019..480.268 rows=1 loops=1)
Filter: jsonb_path_exists(data, '$."publisher"?(@ == "ktjKEZ1tvq")'::jsonpath, '{}'::jsonb, false)
Rows Removed by Filter: 1000028
Planning Time: 0.095 ms
Execution Time: 480.348 ms
(5 rows)

Projecting Partial JSON

Another great use case for JSONPath is projecting partial JSONB from the row that matches. Consider the following sample JSONB:

demo=# select jsonb_pretty(data) from books where id = 1000029;
jsonb_pretty
----------------------------------------
{
 "tags": {
 "nk678947": {
      "ik159670": "iv32358
 }
 },
 "prints": [
     {
         "price": 100,
         "style": "hc"
     },
     {
        "price": 50,
        "style": "pb"
     }
 ],
 "braille": false,
 "keywords": [
     "abc",
     "kef",
     "keh"
 ],
 "hardcover": true,
 "publisher": "ppc3YXL8kK",
 "criticrating": 3
}

Select only the publisher field:

demo=# select jsonb_path_query(data, '$.publisher') from books where id = 1000029;
jsonb_path_query
-----------------------
"ppc3YXL8kK"
(1 row)

Select the prints field (which is an array of objects):

demo=# select jsonb_path_query(data, '$.prints') from books where id = 1000029;
jsonb_path_query
--------------------------------------------------------------------
[{"price": 100, "style": "hc"}, {"price": 50, "style": "pb"}]
(1 row)

Select the first element in the array prints:

demo=# select jsonb_path_query(data, '$.prints[0]') from books where id = 1000029;
jsonb_path_query
------------------------------------
{"price": 100, "style": "hc"}
(1 row)

Select the last element in the array prints:

demo=# select jsonb_path_query(data, '$.prints[$.size()]') from books where id = 1000029;
jsonb_path_query
-----------------------------------
{"price": 50, "style": "pb"}
(1 row)

Select only the hardcover prints from the array:

demo=# select jsonb_path_query(data, '$.prints[*] ?(@.style=="hc")') from books where id = 1000029;
       jsonb_path_query
------------------------------------
 {"price": 100, "style": "hc"}
(1 row)

We can also chain the filters:

demo=# select jsonb_path_query(data, '$.prints[*] ?(@.style=="hc") ?(@.price ==100)') from books where id = 1000029;
jsonb_path_query
------------------------------------
{"price": 100, "style": "hc"}
(1 row)

In summary, PostgreSQL provides a powerful and versatile platform to store and process JSON data. There are several gotcha's that you need to be aware of, but we are optimistic that it will be fixed in future releases.










More tips for you

Which Is the Best PostgreSQL GUI?

PostgreSQL graphical user interface (GUI) tools help these open source database users to manage, manipulate, and visualize their data. In this post, we discuss the top 5 GUI tools for administering your PostgreSQL deployments. Learn more


Managing High Availability in PostgreSQL

Managing high availability in your PostgreSQL hosting is very important to ensuring your clusters maintain exceptional uptime and strong operational performance so your data is always available to your application. Learn more


PostgreSQL Connection Pooling: Part 1 – Pros & Cons

In modern apps, clients open a lot of connections. Developers are discouraged from holding a database connection while other operations take place. “Open a connection as late as possible, close as soon as possible”. Learn more

Top comments (1)

Collapse
 
dalezak profile image
Dale Zak • Edited

Great article! Would it be possible to update the code blocks so they show the exact characters vs special characters. Right now it's showing stuff like &amp;amp;amp;gt; but would be helpful to see what that actual value is.