Frits Hoogland for YugabyteDB

Posted on Jun 20, 2022

Postgres, toast and YugabyteDB

#postgres #yugabyte #debugging #internals

This blogpost is an introduction into the concept of postgres' TOAST (The Oversized-Attribute Storage Technique), and the way we deal with it in YugabyteDB.

The postgres database requires a row to be stored in a single block. Therefore, the total size of a row must be bound by the size of a postgres block, which by default is 8kB. However, there are situations where 8kB is too little for an entire row, or even a single field in a row. This is what TOAST tries to overcome.

First of all a field (also known as attribute, which is how the postgres catalog as well as the documentation calls it) must be eligible for TOAST. Attribute types that are eligible for TOAST are attribute types such as varchar, bytea, text, json, jsonb. These are types that are capable of handling large amounts of data, which can be larger than a postgres block, and are variable of size.

Postgres implements TOAST by evaluating the attributes (fields) of a table and determine if an attribute can potentially store "oversized" data. If that is the case, a toast table is created alongside the original table. This is indicated by the reltoastrelid field in pg_class, which is the OID of a table called pg_toast_OID in the pg_toast schema.

So far, so good: this is all well known, and documented in the postgres documentation and in lots of blogposts.

Fast forward to YugabyteDB. As probably most of the readers are aware, we (Yugabyte) use the postgres source to build a postgres layer we call YSQL (Yugabyte SQL), which uses most of the postgres source functionality, but uses our distributed storage layer called DocDB for persistence and the foundation of the ACID implementation.

Recently, we came across a case in YugabyteDB where retrieving an attribute/column turned out to be slow/slower than expected, which was proven by changing the SQL to retrieve the same column in way that was only slightly different.

This is how the slow execution looked like with explain select:

yugabyte=# explain analyze select data from t1 where cid = 'user-0-0';
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Index Only Scan using ix on t1  (cost=0.00..15.25 rows=100 width=32) (actual time=32.281..141.281 rows=1000 loops=1)
   Index Cond: (cid = 'user-0-0'::text)
   Heap Fetches: 0
 Planning Time: 12.014 ms
 Execution Time: 141.798 ms

And this is the fast execution:

yugabyte=# /*+ indexscan(t1 t1_pkey) */ explain analyze select data from t1 where cid = 'user-0-0';
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Index Scan using t1_pkey on t1  (cost=0.00..15.25 rows=100 width=32) (actual time=24.360..25.536 rows=1000 loops=1)
   Index Cond: (cid = 'user-0-0'::text)
 Planning Time: 0.810 ms
 Execution Time: 25.658 ms

The performance difference is clear: 141.798 versus 25.658 milliseconds.

What is visible, is that an index is used in both cases:

The first case uses an index called 'ix', for which the definition is:

create index ix on t1(cid, id) INCLUDE (ns_id, module, inst_key, pending, data, meta)

In other words: the requested column, data, is added to the index in the include list. This is the reason the plan says "Index Only Scan": cid is part of the index, and the meta column is part of the include list, so the query result can be satisfied from this index.

The second case uses an index too, called 't1_pkey'. This deserves some explanation, because here something YugabyteDB dependent is going on: A table in YugabyteDB is stored in an index like structure based on the primary key. This optimization allows us to combine the table (t1) and the primary key index (t1_pkey). The plan here says it used the index for the selection on cid and looked up the value in the table, in reality what happened is it could obtain the value from the table as if it was part of the index include list.

When looking at performance problems, it's very important to carefully assess where a problem is happening by following a logical approach, and use the correct tool to make that assessment. There is no single tool that can give the correct information at all times.

For most SQL related problems, the first step that must be done is determine if the correct plan is chosen. The above explain analyze output is that first step. In both cases, the expected plan is chosen, for which the actual execution differs quite much in latency.

So that means we must look deeper into the mechanics of how the data is fetched. Sadly, there are no marked roads going from here, such as explain for seeing the planner result and explain analyze for looking at actual runtime statistics.

Both cases are the stripped down versions of the queries, and perform a single task: obtaining a row using an index, which is done a number of times. So there is a high likelihood that something that is performed repeatedly is contributing to the higher response time.

The above logic that something might be executed repeatedly increasing the response time means the most obvious thing to do is to profile the execution using perf NOT collecting backtraces, but simply collecting the active functions:

obtain the pid of the postgres backend:

yugabyte=# select pg_backend_pid();

perf record both the low and high latency execution:

$ sudo perf record -p PID -o fast.perf

The result are the following:

$ sudo perf report -i fast.perf -f --stdio | grep -v ^# | head -n 10
    25.58%  postgres         libc-2.23.so          [.] __memcpy_avx_unaligned
    23.26%  pggate_ybclient  libcrypto.so.1.1      [.] _aesni_ctr32_ghash_6x
    18.60%  pggate_ybclient  libc-2.23.so          [.] __memcpy_avx_unaligned
     6.98%  postgres         libc-2.23.so          [.] __memset_avx2
     4.65%  pggate_ybclient  [kernel.kallsyms]     [k] copy_user_enhanced_fast_string
     2.33%  pggate_ybclient  [kernel.kallsyms]     [k] sk_filter_trim_cap
     2.33%  pggate_ybclient  libev.so.4.0.0        [.] ev_run
     2.33%  pggate_ybclient  libyb_util.so         [.] yb::HdrHistogram::IncrementBy
     2.33%  postgres         libyb_common_base.so  [.] yb::AppendToKey
     2.33%  postgres         postgres              [.] AllocSetReset
$
$ sudo perf report -i slow.perf -f --stdio | grep -v ^# | head -n 10
    85.01%  postgres         postgres               [.] pglz_compress
     1.52%  pggate_ybclient  libcrypto.so.1.1       [.] _aesni_ctr32_ghash_6x
     1.52%  postgres         [kernel.kallsyms]      [k] mem_cgroup_charge_common
     1.14%  pggate_ybclient  [kernel.kallsyms]      [k] clear_page_c_e
     0.95%  pggate_ybclient  [kernel.kallsyms]      [k] copy_user_enhanced_fast_string
     0.95%  pggate_ybclient  libc-2.23.so           [.] __memcpy_avx_unaligned
     0.76%  postgres         [kernel.kallsyms]      [k] clear_page_c_e
     0.76%  postgres         [kernel.kallsyms]      [k] get_page_from_freelist
     0.57%  pggate_ybclient  [kernel.kallsyms]      [k] _raw_spin_unlock_irqrestore
     0.57%  postgres         libc-2.23.so           [.] __memcpy_avx_unaligned

In the 'fast' case, we see a number of functions (__memcpy_avx_unaligned, _aesni_ctr32_ghash_6x, __memset_avx2) which you pretty much expect: the main portion of the work is getting the data from the data attribute, so functions indicating pushing bits (memcpy), and encryption (aesni) is what I would expect.

But when looking at the 'slow' case, there is a function that is truly taking the majority of the time: pglz_compress, which it does for 85% of the time. And, very important: this function seems to be absent from the 'fast' case!

At this point, I am a little confused: if compression would be part of this problem, then DEcompression is what you would expect for a select statement, not compression! But we can dig into this a little deeper: at this point we need to understand the call stack of the plgz_compress function, to see what it is doing. This can be done using gdb, the GNU debugger.

A word of warning: attaching a debugger to a process STOPS the process from running. When running a process with a debugger attached, it will be running significantly slower. Doing this in a production environment can cause severe issues, and therefore should only be done when there is a full understanding of all the caveats of using a debugger.

$ gdb -p PID
...lots of messages...
warning: File "/home/yugabyte/yb-software/yugabyte-2.13.3.0-b12-centos-x86_64/linuxbrew/lib/libthread_db.so.1" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py".

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.

The YugabyteDB YSQL/postgres backend requires the libthread_db.so.1 library, which is disallowed by the default auto load safe path of gdb. The solution to that problem is to manually add the thread library:

(gdb) add-auto-load-safe-path /home/yugabyte/yb-software/yugabyte-2.13.3.0-b12-centos-x86_64/linuxbrew/lib/libthread_db.so.1

Now we must make the debugger stop when the pglz_compress function is called. This is very simple to do in gdb using the 'break' command:

(gdb) break pglz_compress

Breakpoint 1 at 0xb56e08: file ../../../../../src/postgres/src/common/pg_lzcompress.c, line 530.

YugabyteDB is shipped with the debug symbols inside the executable, which makes it possible for gdb to understand where the function is located in the source files. However, we do not ship the source code with the binary distribution, simply because it's not practical. The source code can be obtained from the YugabyteDB GitHub repository: we are open source.

Okay, now that we have set a breakpoint on the function pglz_compress, we need make the attached process running again using the c or continue command:

(gdb) c
Continuing.

Now execute the slow query. Because that will execute the pglz_compress function, the debugger will break execution:

Breakpoint 1, pglz_compress (
    source=0x2a1c03c "{\"attr_0\" : 0, \"attr_1\" : 1, \"attr_2\" : 2, \"attr_3\" : 3, \"attr_4\" : 4, \"attr_5\" : 5, \"attr_6\" : 6, \"attr_7\" : 7, \"attr_8\" : 8, \"attr_9\" : 9, \"attr_10\" : 10, \"attr_11\" : 11, \"attr_12\" : 12, \"attr_13\" :"..., slen=6980,
    dest=0x3742040 "tr_0\" : 1000, \"attr_1\" : 1001, \"attr_2\" : 1002, \"attr_3\" : 1003, \"attr_4\" : 1004, \"attr_5\" : 1005, \"attr_6\" : 1006, \"attr_7\" : 1007, \"attr_8\" : 1008, \"attr_9\" : 1009, \"attr_10\" : 1010, \"attr_11\" : 101"..., strategy=0x4545c4 <strategy_default_data>)
    at ../../../../../src/postgres/src/common/pg_lzcompress.c:530
530 ../../../../../src/postgres/src/common/pg_lzcompress.c: No such file or directory.

Now ask gdb for the backtrace:

(gdb) bt
#0  pglz_compress (
    source=0x2a1c03c "{\"attr_0\" : 0, \"attr_1\" : 1, \"attr_2\" : 2, \"attr_3\" : 3, \"attr_4\" : 4, \"attr_5\" : 5, \"attr_6\" : 6, \"attr_7\" : 7, \"attr_8\" : 8, \"attr_9\" : 9, \"attr_10\" : 10, \"attr_11\" : 11, \"attr_12\" : 12, \"attr_13\" :"..., slen=6980,
    dest=0x3742040 "tr_0\" : 1000, \"attr_1\" : 1001, \"attr_2\" : 1002, \"attr_3\" : 1003, \"attr_4\" : 1004, \"attr_5\" : 1005, \"attr_6\" : 1006, \"attr_7\" : 1007, \"attr_8\" : 1008, \"attr_9\" : 1009, \"attr_10\" : 1010, \"attr_11\" : 101"..., strategy=0x4545c4 <strategy_default_data>)
    at ../../../../../src/postgres/src/common/pg_lzcompress.c:530
#1  0x00000000005558ec in toast_compress_datum (value=44154936)
    at ../../../../../../../src/postgres/src/backend/access/heap/tuptoaster.c:1401
#2  0x00000000005026d5 in index_form_tuple (tupleDescriptor=0x2114190,
    values=<optimized out>, isnull=0x21e2400)
    at ../../../../../../../src/postgres/src/backend/access/common/indextuple.c:98
#3  0x00000000005bda99 in ybcFetchNextIndexTuple (ybScan=0x21e22d8,
    index=0x2113e70, is_forward_scan=<optimized out>)
    at ../../../../../../../src/postgres/src/backend/access/yb_access/yb_scan.c:383
#4  ybc_getnext_indextuple (ybScan=0x21e22d8, is_forward_scan=<optimized out>,
    recheck=<optimized out>)
#5  0x00000000005c0303 in ybcingettuple (scan=0x2349ed8,
    dir=ForwardScanDirection)
    at ../../../../../../../src/postgres/src/backend/access/yb_access/yb_lsm.c:440
#6  0x0000000000559e55 in index_getnext_tid (scan=0x2349ed8, direction=6980)
    at ../../../../../../../src/postgres/src/backend/access/index/indexam.c:594
#7  0x000000000077ffcb in IndexOnlyNext (node=<optimized out>)
    at ../../../../../../src/postgres/src/backend/executor/nodeIndexonlyscan.c:129
#8  0x0000000000760287 in ExecScan (node=0x2348328,
    accessMtd=0x77fe80 <IndexOnlyNext>, recheckMtd=0x7804c0 <IndexOnlyRecheck>)
    at ../../../../../../src/postgres/src/backend/executor/execScan.c:171
#9  0x000000000075e409 in ExecProcNodeInstr (node=0x2348328)
    at ../../../../../../src/postgres/src/backend/executor/execProcnode.c:462
#10 0x000000000075388e in standard_ExecutorRun (queryDesc=0x2183f80,
    direction=<optimized out>, count=0, execute_once=<optimized out>)
    at ../../../../../../src/postgres/src/include/executor/executor.h:249
#11 0x00007fed1ebf5c05 in pgss_ExecutorRun (queryDesc=0x2183f80,
    direction=ForwardScanDirection, count=0, execute_once=true)
    at ../../../../../src/postgres/contrib/pg_stat_statements/pg_stat_statements.c:949
#12 0x00007fed1ebec79a in ybpgm_ExecutorRun (queryDesc=0x2183f80,
    direction=ForwardScanDirection, count=0, execute_once=true)
    at ../../../../../src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c:503
#13 0x00000000006aa3a1 in ExplainOnePlan (plannedstmt=0x20d49d8, into=0x0,
    es=0x2346438,
    queryString=0x1ec2118 "explain analyze select data from t1 where cid = 'user-0-0';", params=<optimized out>, queryEnv=<optimized out>,
    planduration=0x7fffac2cd038)
    at /nfusr/alma8-gcp-cloud/jenkins-worker-ky8wmc/jenkins/jenkins-github-yugabyte-db-alma8-master-clang12-release-435/src/postgres/src/backend/executor/../../../../../../src/postgres/src/backend/executor/execMain.c:307
#14 0x00000000006a9e81 in ExplainOneQuery (query=<optimized out>,
    cursorOptions=<optimized out>, into=0x0, es=0x2346438,
    queryString=0x4545c4 <strategy_default_data> " ", params=0x0, queryEnv=0x0)
    at ../../../../../../src/postgres/src/backend/commands/explain.c:372
#15 0x00000000006a984f in ExplainQuery (pstate=<optimized out>,
    stmt=0x37a08c0,
    queryString=0x1ec2118 "explain analyze select data from t1 where cid = 'user-0-0';", params=0x0, queryEnv=0x0, dest=0x376f640)
    at ../../../../../../src/postgres/src/backend/commands/explain.c:255
#16 0x000000000095849a in standard_ProcessUtility (pstmt=0x37a0a10,
    queryString=0x1ec2118 "explain analyze select data from t1 where cid = 'user-0-0';", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
    dest=0x376f640, completionTag=0x7fffac2cd640 "")
    at ../../../../../../src/postgres/src/backend/tcop/utility.c:699
#17 0x00007fed1ebf5f30 in pgss_ProcessUtility (pstmt=0x2a1c03c,
    queryString=0x1ec2118 "explain analyze select data from t1 where cid = 'user-0-0';", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
    dest=0x376f640, completionTag=0x7fffac2cd640 "")
    at ../../../../../src/postgres/contrib/pg_stat_statements/pg_stat_statements.c:1058
#18 0x00007fed1ebecc17 in ybpgm_ProcessUtility (pstmt=0x2a1c03c,
    queryString=0x1ec2118 "explain analyze select data from t1 where cid = 'user-0-0';", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
    dest=0x376f640, completionTag=0x7fffac2cd640 "")
    at ../../../../../src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c:689
#19 0x00007fed1ebd162e in pg_hint_plan_ProcessUtility (pstmt=0x2a1c03c,
    queryString=0x1b44 <Address 0x1b44 out of bounds>, context=57942080,
    params=0x4545c4 <strategy_default_data>, queryEnv=0x156f760, dest=0x20,
    completionTag=0x7fffac2cd640 "")
    at ../../../../../src/postgres/third-party-extensions/pg_hint_plan/pg_hint_plan.c:3033
#20 0x0000000000b1fddf in YBTxnDdlProcessUtility (pstmt=0x37a0a10,
    queryString=0x1ec2118 "explain analyze select data from t1 where cid = 'user-0-0';", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
    dest=0x376f640, completionTag=0x7fffac2cd640 "")
    at ../../../../../../../src/postgres/src/backend/utils/misc/pg_yb_utils.c:1443
#21 0x00000000009572b4 in PortalRunUtility (portal=0x1fa2118, pstmt=0x37a0a10,
    isTopLevel=<optimized out>, setHoldSnapshot=<optimized out>, dest=0x20,
    at ../../../../../../src/postgres/src/backend/tcop/utility.c:374
#22 0x0000000000956a1d in FillPortalStore (portal=0x1fa2118, isTopLevel=true)
    at ../../../../../../src/postgres/src/backend/tcop/pquery.c:1062
#23 0x0000000000956614 in PortalRun (portal=0x1fa2118,
    count=9223372036854775807, isTopLevel=true, run_once=<optimized out>,
    dest=0x1ec3f10, altdest=0x1ec3f10, completionTag=0x7fffac2cd820 "")
    at ../../../../../../src/postgres/src/backend/tcop/pquery.c:780
#24 0x0000000000952446 in yb_exec_simple_query_impl (query_string=0x1ec2118)
    at ../../../../../../src/postgres/src/backend/tcop/postgres.c:1170
#25 0x0000000000952b2e in yb_exec_query_wrapper (exec_context=0x1ec2000,
    restart_data=0x7fffac2cda50, functor=0x951b90 <yb_exec_simple_query_impl>,
    functor_context=0x1ec2118)
    at ../../../../../../src/postgres/src/backend/tcop/postgres.c:4557
#26 0x000000000094f298 in PostgresMain (argc=1, argv=<optimized out>,
    dbname=<optimized out>, username=0x1eaf378 "yugabyte")
    at ../../../../../../src/postgres/src/backend/tcop/postgres.c:4591
#27 0x00000000008aab9b in BackendRun (port=0x1d361e0)
    at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4555
#28 0x00000000008aa2aa in ServerLoop ()
    at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4198
#29 0x00000000008a6901 in PostmasterMain (argc=<optimized out>, argv=0x1d4c780)
    at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1423
#30 0x00000000007c7233 in PostgresServerProcessMain (argc=23, argv=0x1d4c780)
    at ../../../../../../src/postgres/src/backend/main/main.c:234
#31 0x00000000004f4dd2 in main ()

That is a lot of information. For better understanding, it's probably helpful to manually show it in a different format:

pglz_compress<-toast_compress_datum<-index_form_tuple<-ybcFetchNextIndexTuple<-ybc_getnext_index_type<-ybcingettuple<-index_getnext_tid<-IndexOnlyNext<-ExecScan

When execution the 'fast' query, gdb doesn't break. So the function isn't triggered with the 'fast' query. When looking at the backtrace, you'll see that a little further up the stack, the function index_form_tuple is called. When studying the execution plans, we saw that the 'fast' scan uses the primary key structure, which doesn't contain the 'data' field, which is obtained from the table, and the 'slow' scan uses the index only.

With the following command we can find all functions that match the name specified:

(gdb) info function form_tuple
All functions matching regular expression "form_tuple":

File ../../../../../../../src/postgres/src/backend/access/common/indextuple.c:
void index_deform_tuple(IndexTuple, TupleDesc, Datum *, _Bool *);
IndexTuple index_form_tuple(TupleDesc, Datum *, _Bool *);

File ../../../../../../../src/postgres/src/backend/access/common/heaptuple.c:
void heap_deform_tuple(HeapTuple, TupleDesc, Datum *, _Bool *);
HeapTuple heap_form_tuple(TupleDesc, Datum *, _Bool *);
static void slot_deform_tuple(TupleTableSlot *, int);

File ../../../../../../../src/postgres/src/backend/access/brin/brin_tuple.c:
BrinMemTuple *brin_deform_tuple(BrinDesc *, BrinTuple *, BrinMemTuple *);
BrinTuple *brin_form_tuple(BrinDesc *, BlockNumber, BrinMemTuple *, Size *);

Which shows there is a 'heap_form_tuple' function. If I disable the breakpoint for pglz_compress, and then break for heap_form_tuple, and execute the fast query, the debugger breaks indicating usage of it:

(gdb) i b
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x0000000000b56e08 in pglz_compress
                                                   at ../../../../../src/postgres/src/common/pg_lzcompress.c:530
    breakpoint already hit 1 time
(gdb) dis 1
(gdb) break heap_form_tuple
Breakpoint 2 at 0x500b31: file ../../../../../../../src/postgres/src/backend/access/common/heaptuple.c, line 1085.
(gdb) c
Continuing.

Breakpoint 2, heap_form_tuple (tupleDescriptor=0x1f9f9f8, values=0x2349838,
    isnull=0x2349fa0)
    at ../../../../../../../src/postgres/src/backend/access/common/heaptuple.c:1085
1085    ../../../../../../../src/postgres/src/backend/access/common/heaptuple.c: No such file or directory.
(gdb) bt
#0  heap_form_tuple (tupleDescriptor=0x1f9f9f8, values=0x2349838,
    isnull=0x2349fa0)
    at ../../../../../../../src/postgres/src/backend/access/common/heaptuple.c:1085
#1  0x00000000005bd585 in ybc_getnext_heaptuple (ybScan=0x2349e78,
    is_forward_scan=<optimized out>, recheck=<optimized out>)
    at ../../../../../../../src/postgres/src/backend/access/yb_access/yb_scan.c:309
#2  0x00000000005c0334 in ybcingettuple (scan=0x2349cd8,
    dir=ForwardScanDirection)
    at ../../../../../../../src/postgres/src/backend/access/yb_access/yb_lsm.c:450
#3  0x0000000000559e55 in index_getnext_tid (scan=0x2349cd8,
    direction=37001272)
    at ../../../../../../../src/postgres/src/backend/access/index/indexam.c:594
#4  0x000000000077f462 in index_getnext (scan=<optimized out>,
    direction=<optimized out>)
    at /nfusr/alma8-gcp-cloud/jenkins-worker-ky8wmc/jenkins/jenkins-github-yugabyte-db-alma8-master-clang12-release-435/src/postgres/src/backend/access/index/../../../../../../../src/postgres/src/backend/access/index/indexam.c:744
#5  IndexNext (node=<optimized out>)
    at ../../../../../../src/postgres/src/backend/executor/nodeIndexscan.c:171
#6  0x0000000000760287 in ExecScan (node=0x2348328,
    accessMtd=0x77f2b0 <IndexNext>, recheckMtd=0x77f590 <IndexRecheck>)
 at ../../../../../../src/postgres/src/backend/executor/execScan.c:171
...etc...

Okay. Let's look at the source of index_form_tuple. At line 98 it shows the code that makes the decision, inside a macro named 'TOAST_INDEX_HACK'.

When looking at the heap_form_tuple function, there is no mentioning of compression.

When asking around, it turns out YugabyteDB doesn't perform TOASTing. We can deal with 'oversized' attributes well at the DocDB layer, so nothing is TOASTed.

...so why are large entries obtained via an index compressed in YugabyteDB then??

I decided to now look at how the stack looks like in native postgres for both queries. So I created the schema in a postgres test instance, and had a postgres session setup with gdb attached to the backend, with breaks on index_form_tuple and heap_form_tuple and tested both queries. These did NOT trigger a break in the debugger!

So in postgres natively, the index_form_tuple and heap_form_tuple function are not used in postgres when a query obtains an attribute from an index (index_form_tuple) and from the table (heap_form_tuple), whilst in YugabyteDB I do.

It turns out I can trigger index_form_tuple and heap_form_tuple in postgres, in the following way; first create a simple dummy table:

create table t ( id int);
create index tix on t(id);

Then attach a debugger to that backend, and break on index_form_tuple and heap_form_tuple:

break index_form_tuple
break heap_form_tuple

Now in psql again, perform an insert into table t:

insert into t values (1);

First heap_form_tuple is broken on:

Breakpoint 2, heap_form_tuple (tupleDescriptor=0x25d21a0, values=0x25d1fb0,
    isnull=0x25d1fb8) at heaptuple.c:1067
1067    {
(gdb) bt
#0  heap_form_tuple (tupleDescriptor=0x25d21a0, values=0x25d1fb0,
    isnull=0x25d1fb8) at heaptuple.c:1067
#1  0x00000000006131b5 in ExecMaterializeSlot (slot=slot@entry=0x25d1f50)
    at execTuples.c:806
#2  0x000000000062bcb5 in ExecInsert (mtstate=mtstate@entry=0x25d1b20,
    slot=0x25d1f50, planSlot=0x25d1f50, srcSlot=srcSlot@entry=0x0,
    returningRelInfo=0x25d1a10, estate=estate@entry=0x25d17d0, canSetTag=true)
    at nodeModifyTable.c:301
#3  0x000000000062d28b in ExecModifyTable (pstate=0x25d1b20)
    at nodeModifyTable.c:2243
#4  0x0000000000608e23 in ExecProcNode (node=0x25d1b20)
    at ../../../src/include/executor/executor.h:248
...snipped...

...and when I continue the debugger, and then index_form_tuple:

Breakpoint 1, index_form_tuple (tupleDescriptor=0x7fcf19f786e8,
    values=0x7ffe6341d3a0, isnull=0x7ffe6341d380) at indextuple.c:43
43  {
(gdb) bt
#0  index_form_tuple (tupleDescriptor=0x7fcf19f786e8, values=0x7ffe6341d3a0,
    isnull=0x7ffe6341d380) at indextuple.c:43
#1  0x00000000004c98db in btinsert (rel=0x7fcf19f783c8,
    values=<optimized out>, isnull=<optimized out>, ht_ctid=0x25d1d64,
    heapRel=0x7fcf19f65698, checkUnique=UNIQUE_CHECK_NO, indexInfo=0x25d2090)
    at nbtree.c:200
#2  0x0000000000607f55 in ExecInsertIndexTuples (slot=slot@entry=0x25d1f50,
    tupleid=tupleid@entry=0x25d1d64, estate=estate@entry=0x25d17d0,
    noDupErr=noDupErr@entry=false, specConflict=specConflict@entry=0x0,
    arbiterIndexes=arbiterIndexes@entry=0x0) at execIndexing.c:386
#3  0x000000000062c3bd in ExecInsert (mtstate=mtstate@entry=0x25d1b20,
    slot=0x25d1f50, planSlot=0x25d1f50, srcSlot=srcSlot@entry=0x0,
    returningRelInfo=0x25d1a10, estate=estate@entry=0x25d17d0, canSetTag=true)
    at nodeModifyTable.c:553
#4  0x000000000062d28b in ExecModifyTable (pstate=0x25d1b20)
    at nodeModifyTable.c:2243
#5  0x0000000000608e23 in ExecProcNode (node=0x25d1b20)
    at ../../../src/include/executor/executor.h:248
... snipped ...

Let's look into detail what is happening here. For heap_form_tuple, one frame up there is the function ExecMaterializeSlot. When reading that function I cannot find the heap_form_tuple function, but it turns heap_form_tuple is called inside the function ExecCopySlotTuple, which the compiler seemed to have optimised away.

For index_form_tuple, one frame up there is the function btinsert. Inside that function, index_form_tuple is called.

If you look at the source code of where both heap_form_tuple and index_form_tuple are called, it becomes apparent that these functions are used to create a tuple for storing it in a heap (table) or index, and transform a memory representation to a storage presentation of the tuple.

If we then look back to the YugabyteDB case, index_form_tuple is called as part of READING(!). Why would we call it in YugabyteDB for reading?? Well, "use the source"! So let's look at the source code of ybcFetchNextIndexTuple: we read data from DocDB (YBCPgDmlFetch), and then we create a tuple. So we use index_form_tuple to create a tuple for the postgres layer based on DocDB data.

We do this too for reading from a heap/table, but in the heap_form_tuple function there is no function to compress the data, so we cannot run into pglz_compress there, which is why we only run into the increased time when we obtain the row via the index.

The postgres hackers list shows the reason for the TOAST_INDEXHACK macro: it is to prevent a certain combination of factors to cause an issue.

Conclusion

This journey through debuggers and the source code showed that in YugabyteDB, we use the function index_form_tuple in a different way than PostgreSQL is using it, which means that we use it during reading in YugabyteDB, whilst in PostgreSQL it is used for writing an index entry. In that function, there is a macro that checks for the type and size of an index entry, and will compress it if it's bigger than TOAST_INDEX_TARGET, as a workaround for an issue. In YugabyteDB we don't have that issue, because we are not limited to the block/page size, but still get large index entries compressed.

The workaround is to exclude this field from the index, and let it be fetched from the table. Obviously, we are going to solve this in the YugabyteDB version of index_form_tuple.

DEV Community

Postgres, toast and YugabyteDB

Conclusion

Top comments (0)