Frits Hoogland for YugabyteDB Distributed PostgreSQL Database

Posted on Aug 24, 2021

Postgres query execution: JDBC

#postgres #internals #database #performance

This is a series of blogposts intended to look into how postgres query execution works, so you can get a better understanding, and understand what is happening and what you see. This post is looking at JDBC.

recap

In the previous posts I described how query execution works when postgres is used with the psql (Yugabyte: ysqlsh) CLI tool. These tools use the postgres 'simple query protocol'. The simple query protocol uses a single method of communicating with a postgres backend, and always performs the mandatory steps of parsing, rewriting, planning and execution for the execution of a statement.

jdbc. which jdbc?

The purists will point out that JDBC is not a product but a standard. And indeed, outside of the official postgres driver "PGJDBC" there are other drivers, such as PGJDBC-NG.

I simply do not know what exactly everyone is using, and I think it's impossible to know, since it's freely available, and thus anyone can download a driver from anywhere and use it. However, based on what I encounter with clients and how easy it is to obtain and use it, I do assume the majority of people are using the standard postgres JDBC driver.

I will refer to PGJDBC as 'jdbc' in this article.

jdbc

How does jdbc use postgres, and more importantly: does that provide any optimizations? It does, because by default jdbc uses the extended query protocol, which allows jdbc as a client to very specifically execute the steps of parsing, binding (which includes the planning step) and execution.

tcpdump: simple query protocol

A really simple way to see this is using tcpdump. If you invoke tcpdump in the following way: tcpdump port 5432 -i lo -A, and connect to postgres using localhost, it will print the packet information (which is what tcpdump does in general), and it will print the packet as ASCII.

This is how a packet flow of a query of select 'a' looks like when executed via psql:

08:45:52.630251 IP6 localhost.46454 > localhost.postgres: Flags [P.], seq 160:177, ack 883, win 359, options [nop,nop,TS val 1685027797 ecr 1685023140], length 17
`....1.@.................................v.8...@.......g.9.....
do{.doi.Q....select 'a';.
08:45:52.630433 IP6 localhost.postgres > localhost.46454: Flags [P.], seq 883:949, ack 177, win 342, options [nop,nop,TS val 1685027797 ecr 1685027797], length 66
`..a.b.@.................................8.v.......Q...V.j.....
do{.do{.T...!..?column?...................D..........aC....SELECT 1.Z....I
08:45:52.630440 IP6 localhost.46454 > localhost.postgres: Flags [.], ack 949, win 359, options [nop,nop,TS val 1685027797 ecr 1685027797], length 0
`.... .@.................................v.8...Q.......g.(.....
do{.do{.

This shows 3 packets. The first packet is the client sending a packet to the postgres backend (server). The most important thing to spot is the Q indicating the simple query protocol to the backend, and the query itself: select 'a'.

The second packet comes from the backend and is the response to the query, and sends back multiple flags: T: row description, D: copy data, C: close, Z: ready for query, I: empty query response (commandcomplete).

The third packet doesn't seem to contain anything, and indeed it doesn't. The reason for the packet is the TCP protocol requires a packet to be acknowledged, which is what this packet does.

tcpdump: extended query protocol

If the same exercise is done using a java class via JDBC:

10:12:43.358478 IP localhost.34296 > localhost.postgres: Flags [P.], seq 266:320, ack 415, win 350, options [nop,nop,TS val 4123921496 ecr 4123916705], length 54
E..j    s@.@.3............8...........^.^.....
...X....P.....select 'a'...B............D....P.E... .....S....
10:12:43.358723 IP localhost.postgres > localhost.34296: Flags [P.], seq 415:491, ack 320, win 342, options [nop,nop,TS val 4123921496 ecr 4123921496], length 76
E.....@.@.j\.........8.............V.t.....
...X...X1....2....T...!..?column?...................D..........aC....SELECT 1.Z....I
10:12:43.358728 IP localhost.34296 > localhost.postgres: Flags [.], ack 491, win 350, options [nop,nop,TS val 4123921496 ecr 4123921496], length 0
E..4    t@.@.3N...........8......."...^.(.....
...X...X

This again shows 3 packets. The first packet is the client sending a packet to the postgres backend (server). The packet payload shows much more is going on here: P: the parse command and the query, B: bind, D: message is data row, E: execute, S: synchronize. This shows the different stages of execution being requested. The most important thing here is to see that despite the execution is broken down into multiple commands, it's still sent inside a single packet.

The second packet comes from the backend and is the response to the the query. The response is pretty much identical to the simple query, because the result of the query is the same: T: row description, D: copy data, C: close, Z: ready for query, I: empty query response (commandcomplete).

The third packet is the acknowledgement of the client for the server packet.

traffic cop

The idea of showing the tcpdump contents is to show the difference between the simple query protocol request (Q), and the extended query protocol request (P,B,E). These individual requests for part of query processing can be found with the letters indicating these steps in the source code:

traffic cop: parse
traffic cop: bind
traffic cop: execute

If you then take the required steps, this is where these are executed:

P / parse
exec_parse_message
ㄴ pg_parse_query
  ㄴ raw_parser                  Parse (syntactic)
ㄴ parse_analyze_varparams       Parse (semantic)
ㄴ pg_rewrite_query              Rewrite
B / bind
exec_bind_message
ㄴ GetCachedPlan
  ㄴ BuildCachedPlan
    ㄴ pg_plan_queries
      ㄴ pg_plan_query
        ㄴ planner               Planner
E / execute
exec_execute_message
ㄴ PortalRun
  ㄴ PortalRunSelect
    ㄴ ExecutorRun               Executor

performance

It has been mentioned that execution via the extended protocol allows optimization because it can specifically address the different stages of query execution. But how does that work? Will you get instant performance benefits?

The answer to that is no: if you execute SQL via a regular Statement objects in jdbc, you will do the parse, rewrite, planning and execution steps, performing identical steps as executing it via the simple query protocol does, and thus spend identical time on it.

This changes when prepared statements are used. When prepared statements are prepared once and executed multiple times, then after the first execution of a prepared statement the execution can take advantage of using an already prepared statement on the client/java side and potentially on the server side. This is the topic of the next blogpost.

pg_stat_statements

pg_stat_statements shows the execution time only, or the planner and execution time in newer versions.

log_min_duration_statement

log_min_duration_statement, when configured and if the set time limit is exceeded, shows the time spent in exec_parse_message, exec_bind_message and exec_execute_message individually:

2021-08-17 12:44:02.766 UTC [9813] LOG:  duration: 0.110 ms  parse <unnamed>: select 'a'
2021-08-17 12:44:02.767 UTC [9813] LOG:  duration: 0.040 ms  bind <unnamed>: select 'a'
2021-08-17 12:44:02.767 UTC [9813] LOG:  duration: 0.012 ms  execute <unnamed>: select 'a'

conclusion

Jdbc, or what this article is specifically looking at: PGJDBC allows java to communicate with postgres. There are more postgres jdbc projects.

Jdbc implements the extended query protocol by default. If regular execution is performed, there is no significant difference between psql/ysqlsh simple query execution or jdbc extended protocol usage.

The extended protocol does request the different steps of parse, bind and execute independently, but these are inside a single network packet, so they don't require additional network roundtrips.

With log_min_duration_statement the individual steps are timed as such, so it allows you to build a timing profile of parse, bind (plan) and execute for a statement. pg_stat_statements does keep timing information, but within its limitations.

Top comments (2)

Tatsuo Ishii • Sep 18 '22

Thanks for great article.

The second packet comes from the backend and is the response to the query, and sends back multiple flags: T: row description, D: copy data, C: close, Z: ready for query, I: empty query response (commandcomplete).

I think 'C' means "Command Complete", rather than close. 'C' message means "Close" when the message is sent from client, but it means "Command Complete" when the message is sent from backend.

Also "I" does not mean empty query response here. It's part of the ready for query message and 'I' means now backend is in idle state.

Frits Hoogland • Sep 18 '22

Thank you for the additions. I made some guesses, and guess I should have spent more time on it and read the source code to get the true meaning.

DEV Community