Correlation IDs in Django SQL

#django #microservices #sql #python

If you have more than one service in any of your projects, you should have probably heard of Correlation IDs already. You may know it under other names, but I am sure the concept sounds pretty familiar. Anyway, let me repeat it one more time for everyone to be on the same page.

Each good technical project writes all kind of operational logs. When there are several independent services, each service sends logs to some log aggregation system. The more services there are and the higher the load, the harder it becomes to figure out meaningful sequences of actions. We need some simple pattern to help us operate this mess. This pattern is called Distributed Tracing and its core component is Correlation ID.

Correlation ID (or Trace ID) is a unique identifier for a chain of events within a distributed system. When some distributed flow is started, it acquires a unique ID which is later propagated to all of the actors within the flow. As we are going to talk about Django Web Framework, we will assume that our main communication protocol is HTTP(s), which means that correlation ID is passed in some header. If you want to learn more on the topic, please, refer to the trace context specification by W3C.

Anyway, passing Correlation ID between services is great, but when it comes down to real problem solving, too often we find out that we need to map business flows to the exact technical details under the hood. In that case we need to pull the trace ID from headers and pass it to all the underlying layers, like service layer or ORM. For this article our key goal is to propagate this ID to the deepest possible layer — DB queries. We want SQL to contain this ID, as one API call may trigger several SQL queries and we want to correlate them also.

Let’s look which options we have for Django:

Django-Cid Library (seems to have needed features, but is not widely used and tests are a big question)
Django Log Request Id (only logs, no SQL)
Some DB specific things like Django-MySQL

Hm, seems pretty scarce… What a perfect opportunity for writing code (🚲‍)!

We need to agree that we want correlated SQL only in request-response cycle. There are ways to pass correlation IDs to background processes like celery tasks which also use Django ORM, but we will leave this out of scope. Also we want the dumbest possible implementation to illustrate the idea (KISS).

Thus, the best option is to use Django Middleware to wrap all operations in one place. As we want to affect SQL we need to use Django DB instrumentation. Here is how our wrapper will look like:

We are lucky that main SQL operators have the same length (6), so it’s easy to distinguish them from auxiliary things like SET SESSION TRANSACTION. Our correlation id will be placed right after the operator in SQL comment block /* correlation_id */.

⚠️ This code doesn’t seem like a production ready solution. Extend and test it OR use with caution.

Now let’s write the middleware:

Simple as that. In this example we extracted correlation ID from OpenCensus context, but you could use any other library or simply generate uuid on the fly. In the next step we wrapped all our SQL operations with the above DB context wrapper.

OpenCensus is a set of libraries for various languages that allow you to collect application metrics and distributed traces, then transfer the data to a backend of your choice in real time. This data can be analyzed by developers and admins to understand the health of the application and debug problems.

Don’t forget to add this middleware to Django MIDDLEWARE setting (lower than your tracing middleware (!) if you have one). To test how it works, we can set DEBUG level for django logs:

'loggers': {
    'django': {
        'handlers': ['console'],
        'level': 'DEBUG',
    },
}

And we will get the following result, when we go to Django admin:

2020-06-06 07:43:04,756 django.db.backends DEBUG (0.005) SELECT /* 89fc0ff588659981acaa114a743785d8 */ `django_admin_log`.`id`, `django_admin_log`.`action_time`, `django_admin_log`.`user_id`, `django_admin_log`.`content_type_id`, `django_admin_log`.`object_id`, `django_admin_log`.`object_repr`,....

And that’s it! Our SQLs have the same correlation ID that is used across all of our distributed services. Now it will be much easier to find out bugs and debug performance issues.

I guess, this is also a great example on why we ❤ Python and Django. Complex and useful things can be done in 10 minutes and 10 lines of code.

Thank you for you time and here is the link to the whole code snippet. Stay healthy, inspired and 🤙 me if you have any questions.