pg_stat_statements can identify long queries, but are limited when we need to go further: the queries are normalized with values replaced by parameters. And the metrics gathered are cumulative, about time, with min, max, average, stddev, usually over a long time window. It is hard to catch the statements responsible for the activity at peak load, and when caught, not easy to reproduce the execution plan, know the user, or the application that ran it.
pg_stat_activity holds this information and here is small script to sample it. The goal of sampling the running statements is to start the investigation by looking at the top queries. However, they are all different, with different parameters. What I'm doing here is aggregate the ones sharing the same execution plan. For each active queries in
pg_stat_activity I run an
explain and remove the parts with values (Filter and Index Conditions). Then I take a md5 of it and this will become my "Plan Hash Value".
The function ignores the statements that cannot be explained, so this works only when the commands are single query.
This installs my sampling function and the table to store the plans:
drop table if exists yb_ash_plans; drop function if exists yb_ash_sample; create table if not exists yb_ash_plans( plan_hash_value text primary key, samples int, running_for interval,last_plan text,last_pg_stat_activity jsonb ); create or replace function yb_ash_sample() returns int as $$ declare psa record; plan_xml text; plan_nofilter text; plan_md5 text; last_running_for interval; count_updated int:=0; begin for psa in (select * from pg_stat_activity where state='active' and pid!=pg_backend_pid()) loop begin --raise notice 'SQL %',psa.query; execute format('explain (format xml, costs false, buffers false) %s',psa.query) into plan_xml; exception when others then if sqlstate not like '42%' then raise warning '% % %',sqlstate,sqlerrm,psa.query; end if; end; end loop; if plan_xml is not null then --raise notice 'XML %',plan_xml; plan_nofilter:=regexp_replace(plan_xml,'(<Item>|<Filter>|<Index-Cond>|<Recheck-Cond>|<Join-Filter>|<Merge-Cond>|<Hash-Cond>)[^<]*(</)','\1\2','g'); plan_md5=md5(plan_nofilter); --raise notice e'PHV %\n%\n%',plan_md5,plan_xml,plan_nofilter; update yb_ash_plans set samples=samples+1 where plan_hash_value=plan_md5 returning running_for into last_running_for; --raise notice e'LRF % <> %',last_running_for,clock_timestamp()-psa.query_start; if last_running_for is null then insert into yb_ash_plans(plan_hash_value,samples,running_for,last_plan,last_pg_stat_activity) values(plan_md5,1,clock_timestamp()-psa.query_start,plan_xml,row_to_json(psa)); count_updated:=count_updated+1; else if last_running_for < clock_timestamp()-psa.query_start then update yb_ash_plans set running_for=clock_timestamp()-psa.query_start, last_plan=plan_xml, last_pg_stat_activity=row_to_json(psa) where plan_hash_value=plan_md5; count_updated:=count_updated+1; end if; end if; end if; return count_updated; end; $$ language plpgsql;
yb_ash_sample() function reads
pg_stat_activity. For each query, it attempts and
explain (format xml), removes Filter and Index conditions with a
regexp_replace and calculate a
md5 of it, which identifies the execution plan.
Then, if it is a new one, it will be inserted, with the
pg_stat_activity information from this sample. If it already exists, then it will compare the running time. If higher, the
pg_stat_activity will be updated so that we record a query with a long running time. If not, it will only increases the call count. A high count means that either a query takes long (active during many samples) or is short but frequent enough to be catched by the sampling. This is how I'll identify the statements for which an improvement can have a significant impact on the system.
To collect the statements, I run the sampling in a loop. Every one second can take some resources. 10 seconds may be a safer frequency if running longer.
run sampling from
select yb_ash_sample(); \watch 1
or with a loop:
create or replace procedure yb_ash_loop(seconds int default 1) as $$ declare n int; begin loop select yb_ash_sample(), pg_sleep(seconds) into n; raise notice '% Updated statements: %',clock_timestamp(),n; commit; end loop; end; $$ language plpgsql; call yb_ash_loop(1);
In YugabyteDB it catches the activity on the node where it is connected to. This may be sufficient when the load is distributed by the smart driver or a reverse proxy. But it is important to keep it in mind.
yb_ash_plans stores the plans, in the xml format, but the most important is the query which you can run and explain. Tuning will probably start with the ones with the more samples.
select samples,running_for,last_pg_stat_activity->>'query' query from yb_ash_plans order by samples desc limit 10;
Here is a small example:
yugabyte=> select samples,running_for,last_pg_stat_activity->>'query' query from yb_ash_plans order by samples desc limit 10; samples | running_for | query ---------+-----------------+---------------------------------------------------------------------------- 54 | 00:00:59.299518 | select pg_sleep(60); 20 | 00:00:00.066935 | select count(*) from pgbench_history ; 1 | 00:00:00.087005 | UPDATE pgbench_accounts SET abalance = abalance + -2743 WHERE aid = 42216; 1 | 00:00:00.014626 | SELECT abalance FROM pgbench_accounts WHERE aid = 23344; 1 | 00:00:00.01092 | UPDATE pgbench_branches SET bbalance = bbalance + -4999 WHERE bid = 1; (5 rows)
As a very general approximation, if you divide the the number of samples by the number of sampling, you have the average active sessions running this statement. Here,
pg_sleep(60) has been run once during the 1 second sampling, so we have approximately 60 samples. The "running for" is the longest running time that has been seen and gives an idea to differentiate long running queries (like this
pg_sleep(60) which has been sampled after 59.29 seconds of run time) from frequently run short queries (like this
select count(*) that has been seen 20 times but never on runs longer than 0.06 seconds).
I have left the XML plan as text so that we can extract some information from it. For example, you can try this:
with top_plans as ( select samples,running_for,last_pg_stat_activity->>'query' query,last_plan from yb_ash_plans order by samples desc limit 5 ), top_plan_elements as ( select samples,running_for ,json_build_object( 'tables',json_agg(distinct regexp_replace(plan_relation,'.*<Relation-Name>(.*)</Relation-Name>.*','\1')), 'aliases',json_agg(distinct regexp_replace(plan_alias,'.*<Alias>(.*)</Alias>.*','\1')), 'nodes',json_agg(distinct regexp_replace(plan_node,'.*<Node-Type>(.*)</Node-Type>.*','\1')), 'conditions',json_agg(distinct regexp_replace(plan_cond,'.*<.*-Cond>(.*)</.*-Cond>.*','\1')), 'filters',json_agg(distinct regexp_replace(plan_filter,'.*<Filter>(.*)</Filter>.*','\1')) ) plan_elements ,query from top_plans ,lateral (select * from (select regexp_split_to_table(last_plan,'^ *','n') plan_relation ) plan_relation where plan_relation like '<Relation-Name>%' ) plan_relations ,lateral (select * from (select regexp_split_to_table(last_plan,'^ *','n') plan_alias ) plan_alias where plan_alias like '<Alias>%' ) plan_aliases ,lateral (select * from (select regexp_split_to_table(last_plan,'^ *','n') plan_node ) plan_nodes where plan_node like '<Node-Type>%' ) plan_nodes ,lateral (select * from (select regexp_split_to_table(last_plan,'^ *','n') plan_filter ) plan_filter where plan_filter like '<Filter>%' ) plan_filters ,lateral (select * from (select regexp_split_to_table(last_plan,'^ *','n') plan_cond ) plan_cond where plan_cond like '<%-Cond>%' ) plan_conds group by samples,running_for,query,last_plan ) select * from top_plan_elements ;
Try it during your stress tests and it should show the most active queries, which are those where an optimization can have a significant impact on the system. Any comments welcome.