Whilst the Rolling Stones are rambling on about the time on their side (see previous post), the previous post was about measuring idle and active time in the database. But why stop there? Postgres has an implementation of a wait interface, but sadly lacks the accumulation of the events and the time spent in them. What if we use (perf) probe to measure the occurrence of wait events and the time spent in them?
Wait events (in postgres) are an indicator of a database activity which is suspecting to cause execution to halt running on CPU, such as performing a disk IO. Wait events in postgres do not provide any timing information. This means that a wait event name is set independent from session state (pg_stat_activity.state) and time indicators such as pg_stat_activity.state_change, which is set in pgstat_report_activity when the state is changed. A wait event is set before a database activity suspected to cause halting on CPU execution, and reset after the activity. Examples of activities which cause waiting are: disk IO, locks, waiting for client response, etc.
The function to set a wait event is pgstat_report_wait_start, and the function to reset a wait event is pgstat_report_wait_end. However there is something funny with these functions: when I look up the functions in the postgres executable, they do not exist:
[root@alma-84 ~]# perf probe -x /usr/pgsql-11/bin/postgres --funcs=pgstat_report* pgstat_report_activity pgstat_report_analyze pgstat_report_appname pgstat_report_autovac pgstat_report_deadlock pgstat_report_recovery_conflict pgstat_report_stat pgstat_report_tempfile pgstat_report_vacuum pgstat_report_xact_timestamp
You can see pgstat_report_activity, which we used in the first blog about active time measurement to understand state change, but there is no pgstat_report_wait_start or pgstat_report_wait_end, whilst the sourcecode clearly shows they exist.
It took me a while to realize these functions are being 'inlined' by the compiler, which means the compiler optimized execution by simply copying the function code to all places where the function is called, so that it doesn't need to jump to the function, and return after it. So effectively, in the compiled executable, these functions do not exist anymore as an independent function....
Does that mean it's impossible to use the vanished wait event functions? No. You can use the debuginfo package for postgres:
yum debuginfo-install postgresql11-server
This adds DWARF debug symbols, which do contain the original functions, and their addresses in the executable.
There is a caveat however: by comparing the postgres development group (PGDG) RPM version with the debuginfo package installed and a self compiled version using the source code of the same version and the --enable-debug flag, I found a significantly lower number of functions for the pgstat_report_wait_start and pgstat_report_wait_end functions with the PGDG RPM version plus debuginfo, for which the reason currently is unclear to me.
Can you install debuginfo on a production database? Purely technical, the currently used executable is not changed, so there is no behavioral or performance change. What debuginfo packages do is install separate files that give more details about the executable, that tools like gdb (GNU debugger) and perf can use.
If we take the pgstat_report_activity function for example, that function now can be probed in the following way:
perf probe -x /usr/pgsql-11/bin/postgres 'pgstat_report_activity state:u32 cmd_str:string'
The outcome of running commands that trigger this probe look like this now:
postmaster 13203  17176.638103: probe_postgres:pgstat_report_activity: (6eea90) state_u32=1 cmd_str_string="^T" postmaster 13203  17182.623807: probe_postgres:pgstat_report_activity: (6eea90) state_u32=2 cmd_str_string="SELECT n.nspname as "Schema", c.relname as "Name", CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' WHEN 'm' THEN 'materialized view' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 's' THEN 'special' WHEN 'f' THEN 'foreign table' WHEN 'p' THEN 'table' WHEN 'I' THEN 'index' END as "Ty> pg_catalog.pg_get_userbyid(c.relowner) as "Owner" FROM pg_catalog.pg_class c LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace WHERE c.relkind IN ('r','p','v','m','S','f','') AND n.nspname <> 'pg_catalog' AND n.nspname <> 'information_schema' AND n.nspname !~ '^pg_toast' AND pg_catalog.pg_table_is_visible(c.oid) ORDER BY 1,2;" postmaster 13203  17182.655834: probe_postgres:pgstat_report_activity: (6eea90) state_u32=1 cmd_str_string="^T" postmaster 13203  17187.596582: probe_postgres:pgstat_report_activity: (6eea90) state_u32=2 cmd_str_string="select * from t;" postmaster 13203  17187.597860: probe_postgres:pgstat_report_activity: (6eea90) state_u32=1 cmd_str_string="^T"
Every time the state is set to 2 (running), cmd_str contains the command the client executed. The elaborate command is the psql '\d' command. Every time the state is set to 1 (idle), there isn't a command, and '^T' apparently is postgres' way of indicating 'nothing'.
Now with the debuginfo package (or a self-compiled version configured with --enable-debug), we can probe the start and end of wait events:
perf probe -x /usr/pgsql-11/bin/postgres 'pgstat_report_wait_start $params' perf probe -x /usr/pgsql-11/bin/postgres pgstat_report_wait_end
Together with the probe on pgstat_report_activity, we can determine idle time and running time, and we can measure the time spent waiting. Obviously with the caveat that not all wait events are visible in the PGDG packaged version with the debuginfo addition, making it unpredictable. However, it can be used with a postgres version that is compiled from source with
[root@alma-84 ~]# perf record -e 'probe_postgres:*' -p 13841 ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.033 MB perf.data (14 samples) ] [root@alma-84 ~]# perf script postgres 16154  96119.505818: probe_postgres:pgstat_report_wait_end_77: (710e00) postgres 16154  96119.505835: probe_postgres:pgstat_report_activity: (6bf5a0) state_u32=2 cmd_str_string="create table t (t int);" postgres 16154  96119.506233: probe_postgres:pgstat_report_wait_start_33: (4f6117) postgres 16154  96119.506278: probe_postgres:pgstat_report_wait_end_42: (4f6144) postgres 16154  96119.507787: probe_postgres:pgstat_report_activity: (6bf5a0) state_u32=1 cmd_str_string="^T" postgres 16154  96119.507799: probe_postgres:pgstat_report_wait_start_59: (710c46) wait_event_info=0x6000000 postgres 16154  96121.418192: probe_postgres:pgstat_report_wait_end_77: (710e00) postgres 16154  96121.418209: probe_postgres:pgstat_report_activity: (6bf5a0) state_u32=2 cmd_str_string="insert into t select i from generate_series(1,100) i;" postgres 16154  96121.418428: probe_postgres:pgstat_report_wait_start_52: (708f7f) wait_event_info=0xa000009 postgres 16154  96121.418463: probe_postgres:pgstat_report_wait_end_69: (708fb8) postgres 16154  96121.418528: probe_postgres:pgstat_report_wait_start_33: (4f6117) postgres 16154  96121.418541: probe_postgres:pgstat_report_wait_end_42: (4f6144) postgres 16154  96121.419979: probe_postgres:pgstat_report_activity: (6bf5a0) state_u32=1 cmd_str_string="^T" postgres 16154  96121.419993: probe_postgres:pgstat_report_wait_start_59: (710c46) wait_event_info=0x6000000
Let me walk you through it (numbers counted starting from the first line after the perf script output):
- At line 1, a wait_end command is executed, indicating a registered wait event is cleared. Because at line 2 we see pgstat_report_activity setting state to 2 (running), it's a fair guess that we were in state 1 (idle), and that the wait event that is cleared is 'client read'. We can't know, because pgstat_report_wait_end does not take an argument, it simply clears a wait event.
- At line 2, activity is reported using pgstat_report_activity, and the state is set to 2 (running). The reason for doing so is the execution of the SQL command 'create table t (t int)'.
- At line 3, a wait event is started. Probably because of a compiler optimization, this wait event cannot be probed with the wait event number indicated by the wait_event_info variable.
- At line 4, the wait event is ended. Ending a wait event never carries a wait event number.
- At line 5, pg_stat_report_activity sets state to 1 (idle) again.
- At line 6, pg_stat_report_wait_start sets the wait state to 0x6000000. That is wait event group number 6 (client), wait number 0 (client read). This is the general indicator of waiting for the client, which we also could understand from state 0x1.
- At line 7, pg_stat_report_wait_end indicates the wait ends, because the client sent another command.
- At line 8, pg_stat_report_activity sets the state to 2 (running) again, for the execution of 'insert into t select i from generate_series(1,100) i;'.
- At line 9, pg_stat_report_wait_start sets the wait state to 0xa000009. That is wait event group number 10 (IO), wait number 9 (data file extend). So this indicates growing the datafile on disk because of the added data.
- At line 10, pg_stat_report_wait_end ends the wait state, indicating it performed the work that this wait event times.
- At line 11 and 12, another start and end of a wait event happen that does not provide the argument of the wait, however the probe number does show it's the same wait at line 3.
- At line 13, pg_stat_report_activity sets the state to 1 (idle) again.
- At line 14, pg_stat_report_wait_start sets the wait state to 0x6000000 (client:client read), indicating it's idle and waits for another command to be executed.
This obviously is too much detail to be useful. However, this data provides a lot of information that otherwise is not available by default (the total running time of a SQL in the backend: pg_stat_statements only provides planner and executor latency, not full command latency; log_min_duration_statement does provide this time), but also a complete overview of all waits encountered, which allows a better understanding of running on CPU versus waiting.
Here is an awk script I created to profile this data, which groups it per process it finds in the file:
[root@alma-84 ~]# perf script | ./profile.awk total time in file 1.914175 s 100.00 % pid: 16154 1.914175 s 100.00 % event client:client read 1 1.910393 s 99.80 % event io:data file extend 1 0.000035 s 0.00 % event unknown:unknown 2 0.000058 s 0.00 % state idle 1 1.910422 s 99.80 % state running 2 0.003722 s 0.19 %
This allows us to see the majority of time is spend in the state idle, which aligns perfectly with the event client read, which aims to time the same: both take 99.8% of the time of pid 16154, which is 1.914175 seconds.
The total time active alias running was 0.003722 seconds, of which 0.000035+0.000058=0.000093 seconds was waiting. In other words: 0.000093/0.003722*100=2.49% of the time of this execution was spend waiting. So tuning IO will potentially get you 2.49% back.
Of course this overview is a bit silly, but I don't think it requires a lot of imagination to see how this can be used to profile execution that has problematic performance to see where the time is spend on.