DEV Community

ScaleGrid for ScaleGrid

Posted on • Edited on • Originally published at scalegrid.io

PyMongo Tutorial: Testing MongoDB Failover in Your Python App

PyMongo Tutorial: Testing MongoDB Failover in Your Python App

Python is a powerful and flexible programming language used by millions of developers around the world to build their applications. It comes as no surprise that Python developers commonly leverage MongoDB hosting, the most popular NoSQL database, for their deployments due to its flexible nature and lack of schema requirements.

So, what's the best way to use MongoDB with Python? PyMongo is a Python distribution containing tools for working with MongoDB, and the recommended Python MongoDB driver. It is a fairly mature driver that supports most of the common operations with the database, and you can check out this tutorial for an introduction to the PyMongo driver.

When deploying in production, it's highly recommended to setup in a MongoDB replica set configuration so your data is geographically distributed for high availability. It is also recommended that SSL connections be enabled to encrypt the client-database traffic. We often undertake testing of failover characteristics of various MongoDB drivers to qualify them for production use cases, or when our customers ask us for advice. In this post, we show you how to connect to an SSL-enabled MongoDB replica set configured with self-signed certificates using PyMongo, and how to test MongoDB failover behavior in your code.

Connecting to MongoDB SSL Using Self-Signed Certificates

The first step is to ensure that the right versions of PyMongo and its dependencies are installed. This guide helps you in sorting out the dependencies, and the driver compatibility matrix can be found here.

The mongo_client.MongoClient parameters that are of interest to us are ssl and ss_ca_cert. In order to connect to an SSL-enabled MongoDB endpoint that uses a self-signed certificate, ssl must be set to True and ss_ca_cert must point to the CA certificate file.

If you are a ScaleGrid customer, you can download the CA certificate file for your MongoDB clusters from the ScaleGrid console as shown here:

Download the CA Certificate File for Your MongoDB Clusters From the ScaleGrid ConsoleSo, a connection snippet would look like:

>>> import pymongo
>>> MONGO_URI = 'mongodb://rwuser:@SG-example-0.servers.mongodirector.com:27017,SG-example-1.servers.mongodirector.com:27017,SG-example-2.servers.mongodirector.com:27017/admin?replicaSet=RS-example&ssl=true'
>>> client = pymongo.MongoClient(MONGO_URI, ssl = True, ssl_ca_certs = '')
>>> print("Databases - " + str(client.list_database_names()))
Databases - ['admin', 'local', 'test']
>>> client.close()
>>>

If you are using your own self-signed certificates where hostname verification might fail, you will also have to set the ssl_match_hostname parameter to False. Like the driver documentation says, this is not recommended as it makes the connection susceptible to man-in-the-middle attacks.

Testing Failover Behavior

With MongoDB deployments, failovers aren't considered major events as they were with traditional database management systems. Although most MongoDB drivers try to abstract this event, developers should understand and design their applications for such behavior, as applications should expect transient network errors and retry before percolating errors up.

You can test the resilience of your applications by inducing failovers while your workload runs. The easiest way to induce failover is to run the rs.stepDown() command:

RS-example-0:PRIMARY> rs.stepDown()
2019-04-18T19:44:42.257+0530 E QUERY [thread1] Error: error doing query: failed: network error while attempting to run command 'replSetStepDown' on host 'SG-example-1.servers.mongodirector.com:27017' :
DB.prototype.runCommand@src/mongo/shell/db.js:168:1
DB.prototype.adminCommand@src/mongo/shell/db.js:185:1
rs.stepDown@src/mongo/shell/utils.js:1305:12
@(shell):1:1
2019-04-18T19:44:42.261+0530 I NETWORK [thread1] trying reconnect to SG-example-1.servers.mongodirector.com:27017 (X.X.X.X) failed
2019-04-18T19:44:43.267+0530 I NETWORK [thread1] reconnect SG-example-1.servers.mongodirector.com:27017 (X.X.X.X) ok
RS-example-0:SECONDARY>

One of the ways I like to test the behavior of drivers is by writing a simple 'perpetual' writer app. This would be simple code that keeps writing to the database unless interrupted by the user, and would print all exceptions it encounters to help us understand the driver and database behavior. I also keep track of the data it writes to ensure that there's no unreported data loss in the test. Here's the relevant part of test code we will use to test our MongoDB failover behavior:

import logging
import traceback
...
import pymongo
...
logger = logging.getLogger("test")

MONGO_URI = 'mongodb://rwuser:@SG-example-0.servers.mongodirector.com:48273,SG-example-1.servers.mongodirector.com:27017,SG-example-2.servers.mongodirector.com:27017/admin?replicaSet=RS-example-0&ssl=true'

try:
    logger.info("Attempting to connect...")
    client = pymongo.MongoClient(MONGO_URI, ssl = True, ssl_ca_certs = 'path-to-cacert.pem')
    db = client['test']
    collection = db['test']
    i = 0
    while True:
        try:
            text = ''.join(random.choices(string.ascii_uppercase + string.digits, k = 3))
            doc = { "idx": i, "date" : datetime.utcnow(), "text" : text}
            i += 1
            id = collection.insert_one(doc).inserted_id
            logger.info("Record inserted - id: " + str(id))
            sleep(3)
        except pymongo.errors.ConnectionFailure as e:
            logger.error("ConnectionFailure seen: " + str(e))
            traceback.print_exc(file = sys.stdout)
            logger.info("Retrying...")

    logger.info("Done...")
except Exception as e:
    logger.error("Exception seen: " + str(e))
    traceback.print_exc(file = sys.stdout)
finally:
    client.close()

The sort of entries that this writes look like:

RS-example-0:PRIMARY> db.test.find()
{ "_id" : ObjectId("5cb6d6269ece140f18d05438"), "idx" : 0, "date" : ISODate("2019-04-17T07:30:46.533Z"), "text" : "400" }
{ "_id" : ObjectId("5cb6d6299ece140f18d05439"), "idx" : 1, "date" : ISODate("2019-04-17T07:30:49.755Z"), "text" : "X63" }
{ "_id" : ObjectId("5cb6d62c9ece140f18d0543a"), "idx" : 2, "date" : ISODate("2019-04-17T07:30:52.976Z"), "text" : "5BX" }
{ "_id" : ObjectId("5cb6d6329ece140f18d0543c"), "idx" : 4, "date" : ISODate("2019-04-17T07:30:58.001Z"), "text" : "TGQ" }
{ "_id" : ObjectId("5cb6d63f9ece140f18d0543d"), "idx" : 5, "date" : ISODate("2019-04-17T07:31:11.417Z"), "text" : "ZWA" }
{ "_id" : ObjectId("5cb6d6429ece140f18d0543e"), "idx" : 6, "date" : ISODate("2019-04-17T07:31:14.654Z"), "text" : "WSR" }
..

Handling the ConnectionFailure Exception

Notice that we catch the ConnectionFailure exception to deal with all network-related issues we may encounter due to failovers - we print the exception and continue to attempt to write to the database. The driver documentation recommends that:

If an operation fails because of a network error, ConnectionFailure is raised and the client reconnects in the background. Application code should handle this exception (recognizing that the operation failed) and then continue to execute.

Let's run this and do a database failover while it executes. Here's what happens:

04/17/2019 12:49:17 PM INFO Attempting to connect...
04/17/2019 12:49:20 PM INFO Record inserted - id: 5cb6d3789ece145a2408cbc7
04/17/2019 12:49:23 PM INFO Record inserted - id: 5cb6d37b9ece145a2408cbc8
04/17/2019 12:49:27 PM INFO Record inserted - id: 5cb6d37e9ece145a2408cbc9
04/17/2019 12:49:30 PM ERROR PyMongoError seen: connection closed
Traceback (most recent call last):
    id = collection.insert_one(doc).inserted_id
  File "C:\Users\Random\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pymongo\collection.py", line 693, in insert_one
    session=session),
...
  File "C:\Users\Random\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pymongo\network.py", line 173, in receive_message
    _receive_data_on_socket(sock, 16))
  File "C:\Users\Random\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pymongo\network.py", line 238, in _receive_data_on_socket
    raise AutoReconnect("connection closed")
pymongo.errors.AutoReconnect: connection closed
04/17/2019 12:49:30 PM INFO Retrying...
04/17/2019 12:49:42 PM INFO Record inserted - id: 5cb6d3829ece145a2408cbcb
04/17/2019 12:49:45 PM INFO Record inserted - id: 5cb6d3919ece145a2408cbcc
04/17/2019 12:49:49 PM INFO Record inserted - id: 5cb6d3949ece145a2408cbcd
04/17/2019 12:49:52 PM INFO Record inserted - id: 5cb6d3989ece145a2408cbce

Notice that the driver takes about 12 seconds to understand the new topology, connect to the new primary, and continue writing.  The exception raised is errors.AutoReconnect which is a subclass of ConnectionFailure.

You could do a few more runs to see what other exceptions are seen. For example, here's another exception trace I encountered:

    id = collection.insert_one(doc).inserted_id
  File "C:\Users\Random\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pymongo\collection.py", line 693, in insert_one
    session=session),
...
  File "C:\Users\Randome\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pymongo\network.py", line 150, in command
    parse_write_concern_error=parse_write_concern_error)
  File "C:\Users\Random\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pymongo\helpers.py", line 132, in _check_command_response
    raise NotMasterError(errmsg, response)
pymongo.errors.NotMasterError: not master

This exception is also a sub class of ConnectionFailure.

'retryWrites' Parameter

Another area to test MongoDB failover behavior would be seeing how other parameter variations affect the results. One parameter that is relevant is 'retryWrites':

retryWrites: (boolean) Whether supported write operations executed within this MongoClient will be retried once after a network error on MongoDB 3.6+. Defaults to False.

Let's see how this parameter works with a failover. The only change made to the code is:

client = pymongo.MongoClient(MONGO_URI, ssl = True, ssl_ca_certs = 'path-to-cacert.pem', retryWrites = True)

Let's run it now, and then do a database system failover:

04/18/2019 08:49:30 PM INFO Attempting to connect...
04/18/2019 08:49:35 PM INFO Record inserted - id: 5cb895869ece146554010c77
04/18/2019 08:49:38 PM INFO Record inserted - id: 5cb8958a9ece146554010c78
04/18/2019 08:49:41 PM INFO Record inserted - id: 5cb8958d9ece146554010c79
04/18/2019 08:49:44 PM INFO Record inserted - id: 5cb895909ece146554010c7a
04/18/2019 08:49:48 PM INFO Record inserted - id: 5cb895939ece146554010c7b <<< Failover around this time
04/18/2019 08:50:04 PM INFO Record inserted - id: 5cb895979ece146554010c7c
04/18/2019 08:50:07 PM INFO Record inserted - id: 5cb895a79ece146554010c7d
04/18/2019 08:50:10 PM INFO Record inserted - id: 5cb895aa9ece146554010c7e
04/18/2019 08:50:14 PM INFO Record inserted - id: 5cb895ad9ece146554010c7f
...

Notice how the insert after the failover takes about 12 seconds, but goes through successfully as the retryWrites parameter ensures the failed write is retried. Remember that setting this parameter doesn't absolve you from handling the ConnectionFailure exception - you need to worry about reads and other operations whose behavior is not affected by this parameter. It also doesn't completely solve the issue, even for supported operations - sometimes failovers can take longer to complete and retryWrites alone will not be enough.

Configuring the Network Timeout Values

rs.stepDown() induces a rather quick failover, as the replica set primary is intructed to become a secondary, and the secondaries hold an election to determine the new primary. In production deployments, network load, partition, and other such issues delay the detection of unavailability of the primary server, thus, prolonging your failover time. You would also often run into PyMongo errors like errors.ServerSelectionTimeoutError, errors.NetworkTimeout, etc. during network issues and failovers.

If this occurs very often, you must look to tweak the timeout parameters. The related MongoClient timeout parameters are serverSelectionTimeoutMS, connectTimeoutMS, and socketTimeoutMS. Of these, selecting a larger value for serverSelectionTimeoutMS most often helps in dealing with errors during failovers:

serverSelectionTimeoutMS: (integer) Controls how long (in milliseconds) the driver will wait to find an available, appropriate server to carry out a database operation; while it is waiting, multiple server monitoring operations may be carried out, each controlled by connectTimeoutMS. Defaults to 30000 (30 seconds).

Ready to use MongoDB in your Python application? Check out our Getting Started with Python and MongoDB article to see how you can get up and running in just 5 easy steps. ScaleGrid is the only MongoDB DBaaS provider that it gives you full SSH access to your instances so you can run your Python server on the same machine as your MongoDB server. Automate your MongoDB cloud deployments on AWS, Azure, or DigitalOcean with dedicated servers, high availability, and disaster recovery so you can focus on developing your Python application.

Top comments (0)