A Tale of Two Connection Pools

#cockroachdb #connectionpooling #distributedsql

It was the best of pools; it was the worst of pools...
Ok, enough with the Dickensian references.

Connection Pools

I have recently had two contrasting experiences with application-side, connection pooling frameworks when connecting to CockroachDB, and I thought I would share some of my findings.

I am a Solutions Engineer at Cockroach Labs, so I often work with customers and prospects on testing out their codebase running against a CockroachDB cluster. CockroachDB is a distributed SQL database (meaning a CockroachDB cluster is made up of many nodes). This distributed DNA comes with obvious advantages like being about to do a rolling restart of the cluster nodes and not having to take the database down. However, it's a good idea to make sure that your application is configured to handle a node restarts.

For you application to gracefully handle node restarts, we typically recommend:

Load Balancer - Point to a load balancer that is round-robin-ing traffic among the CRDB nodes and monitoring for DB node health (and taking dead nodes out of the spray). This can be a simple L4, TCP load balancer. If you're running CockroachDB in the our managed service offering, this load balancer is provided for you.
Connection Pool - Use a connection pool in your code (HikariCP is great for Java). We would recommend a few settings:
- max pool size (recommendations on size here: https://www.cockroachlabs.com/docs/stable/connection-pooling.html#sizing-connection-pools)
- max connection lifetime - set to maybe 8 hours and tune it down as necessary; this will make sure that your pool has a balanced number of connections within 8 hours of a rolling restart event.
Include Retry Logic in your code - We recommend catching and retrying serialization retry (i.e., 40001) errors in code. You can add a similar class of errors for 57* errors and 08* errors (see: https://www.postgresql.org/docs/current/errcodes-appendix.html)

Here's an example of a project I wrote in C# that shows this retry logic in action.

These three elements (LB + CP + Retry Logic) interact with each other, and with all three in place, you will have a fool-proof approach to handling restarts in your code with zero impact.

Here's a diagram to shows how this interaction works.

How to handle leaking state across session

Connection pools are great in that they save us the latencies associated with opening new connections to our database, and they allow us to multiplex lots of queries across a fewer number of connections to the database.

Now, let's talk about one risk associated with leveraging them.

Each database connection instance has various metadata associated with it. In the CockroachDB world (which is Postgres compatible), one such example of state is "session variables." For instance, there is a session variable called statement_timeout which, by default, has a value of 0 (which means unlimited). Suppose you set this variable to 60s, do something with the connection and then return it to the pool. Then, the next process to retrieve that particular connection from the pool will have the statement_timeout value set to 60s and will have no notice that they're not getting a "clean" session/connection.

To battle this idea of state leakage, some connection pool instances automatically run commands on connections when they're returned to the pool whose purpose is to restore all the state back to defaults.

I was working on a proof-of-concept recently with a company who writes their code in C#/.NET and they were using NPGSQL (which is a Postgres driver library for .NET) to connect to CockroachDB. When you use the connection pool builtin to this project and close a connection, NPGSQL runs several commands against the connection:

CLOSE ALL;
UNLISTEN *;
SELECT pg_advisory_unlock_all();
DISCARD SEQUENCES;
DISCARD TEMP;

It makes sense that a Postgres-specific driver would run Postgres-specific commands against a data source that is known to be Postgres (or Postgres-compatible).

However, in the Java world, developers have many connection pooling toolsets from which to choose. A popular choice is HikariCP. HikariCP is not specific to a particular DB (like Postgres), but can be used against any JDBC Data Source (Oracle, MySQL, etc.). I have always assumed that Hikari does similar "resetting" logic but I tested that assumption this week and learned that it does not. I suppose this makes sense since it would have to have knowledge about the "right" commands for each DB platform in order to reset state properly.

So, I went about trying to make Hikari do this resetting for me.

Code Testing

First, I setup some simple code to show the problem:



    public void showBroken() throws SQLException  {

        try {
            HikariDataSource ds = new HikariDataSource();
            ds.setJdbcUrl("jdbc:postgresql://localhost:26257/default?sslmode=require");
            ds.setUsername("roach");
            ds.setPassword("roach");
            ds.setMaximumPoolSize(1);

            Connection conn1 = ds.getConnection();
            Statement stmt1 = conn1.createStatement();
            ResultSet rs1 = stmt1.executeQuery("show statement_timeout;");
            System.out.println("value of statement_timeout before messing with it: ");
            while (rs1.next()) {
                System.out.println(rs1.getString(1));
            }
            Statement stmt2 = conn1.createStatement();
            stmt2.execute("set statement_timeout = '60s';");
            ResultSet rs2 = stmt1.executeQuery("show statement_timeout;");
            System.out.println("value of statement_timeout after setting it: ");
            while (rs2.next()) {
                System.out.println(rs2.getString(1));
            }
            conn1.close();

            Connection conn2 = ds.getConnection();
            Statement stmt3 = conn2.createStatement();
            ResultSet rs3 = stmt3.executeQuery("show statement_timeout;");
            System.out.println("value of statement_timeout in new connection without changing anything: ");
            while (rs3.next()) {
                System.out.println(rs3.getString(1));
            }
            conn2.close();

            ds.close();
        }
        catch(Exception ex) {
            System.out.println(ex.toString());
        }


    }

When I run this code, I get this output:



value of statement_timeout before messing with it: 
0
value of statement_timeout after setting it: 
60000
value of statement_timeout in new connection without changing anything: 
60000

As you can see, we're "leaking" the 60s value between sessions. Not good!

I found one suggestion from the author of HikariCP on how to address this, which I implemented and it worked. However, there are additional classes involved, and it feels a little clunky and hard to follow.

So, I reached out to my colleague, Kai Niemi, who is a seriously smart dude, and he gave me two other suggestions which I found to be much cleaner.

Option 1 - Use a proxy

In this example, we use some Java Proxy + Reflection magic to create a HikariCP pool instance but we give ourselves a "hook" into the pooling events where we can insert a call to "DISCARD ALL;"



    public void runTest2() throws SQLException  {

        try {
            HikariDataSource ds = new HikariDataSource() {
                @Override
                public Connection getConnection() throws SQLException {
                    Connection delegate = super.getConnection();
                    return (Connection) Proxy.newProxyInstance(
                            DataSource.class.getClassLoader(),
                            new Class[] {Connection.class},
                            (proxy, method, args) -> {
                                if (method.getName().equals("close")) {
                                    Connection c = (Connection) proxy;
                                    try (Statement s = c.createStatement()) {
                                        s.execute("DISCARD ALL;");
                                    } catch (SQLException e) {
                                        // not much to do, proceed with close
                                    }
                                }
                                return method.invoke(delegate, args);
                            });
                }
            };
            ds.setJdbcUrl("jdbc:postgresql://localhost:26257/default?sslmode=require");
            ds.setUsername("roach");
            ds.setPassword("roach");
            ds.setMaximumPoolSize(1);

            Connection conn1 = ds.getConnection();
            Statement stmt1 = conn1.createStatement();
            ResultSet rs1 = stmt1.executeQuery("show statement_timeout;");
            System.out.println("value of statement_timeout before messing with it: ");
            while (rs1.next()) {
                System.out.println(rs1.getString(1));
            }
            Statement stmt2 = conn1.createStatement();
            stmt2.execute("set statement_timeout = '60s';");
            ResultSet rs2 = stmt1.executeQuery("show statement_timeout;");
            System.out.println("value of statement_timeout after setting it: ");
            while (rs2.next()) {
                System.out.println(rs2.getString(1));
            }
            conn1.close();

            Connection conn2 = ds.getConnection();
            Statement stmt3 = conn2.createStatement();
            ResultSet rs3 = stmt3.executeQuery("show statement_timeout;");
            System.out.println("value of statement_timeout in new connection without changing anything: ");
            while (rs3.next()) {
                System.out.println(rs3.getString(1));
            }
            conn2.close();

            ds.close();
        }
        catch(Exception ex) {
            System.out.println(ex.toString());
        }


    }

The output we get from this code shows that we're not leaking anymore - yay!



value of statement_timeout before messing with it: 
0
value of statement_timeout after setting it: 
60000
value of statement_timeout in new connection without changing anything: 
0

Option 2 - Use a library

In this option, we wrap our HikariCP instance in another object. The nice thing about this option is that it can be used with any connection pooling library (not just HikariCP); but, it does require that you add an additional dependency to your project: https://github.com/jdbc-observations/datasource-proxy



    private void runTest3() {

        try {

            HikariConfig config = new HikariConfig();
            config.setJdbcUrl("jdbc:postgresql://localhost:26257/default?sslmode=require");
            config.setUsername("roach");
            config.setPassword("roach");
            config.setMaximumPoolSize(1);

            DataSource ds = ProxyDataSourceBuilder
                    .create(new HikariDataSource(config))
                    .listener(new JdbcLifecycleEventListenerAdapter() {
                        @Override
                        public void afterGetConnection(MethodExecutionContext executionContext) {
                            Connection c = (Connection) executionContext.getResult();
                            try (Statement s = c.createStatement()) {
                                s.execute("DISCARD ALL;");
                            } catch (SQLException e) {
                                // whatever
                            }
                        }
                    })
                    .build();

            Connection conn1 = ds.getConnection();
            Statement stmt1 = conn1.createStatement();
            ResultSet rs1 = stmt1.executeQuery("show statement_timeout;");
            System.out.println("value of statement_timeout before messing with it: ");
            while (rs1.next()) {
                System.out.println(rs1.getString(1));
            }
            Statement stmt2 = conn1.createStatement();
            stmt2.execute("set statement_timeout = '60s';");
            ResultSet rs2 = stmt1.executeQuery("show statement_timeout;");
            System.out.println("value of statement_timeout after setting it: ");
            while (rs2.next()) {
                System.out.println(rs2.getString(1));
            }
            conn1.close();

            Connection conn2 = ds.getConnection();
            Statement stmt3 = conn2.createStatement();
            ResultSet rs3 = stmt3.executeQuery("show statement_timeout;");
            System.out.println("value of statement_timeout in new connection without changing anything: ");
            while (rs3.next()) {
                System.out.println(rs3.getString(1));
            }
            conn2.close();

        }
        catch(Exception ex) {
            System.out.println(ex.toString());
        }

    }

Again, we see that the output proves that our leak has been plugged.



value of statement_timeout before messing with it: 
0
value of statement_timeout after setting it: 
60000
value of statement_timeout in new connection without changing anything: 
0