Why Distributed Transactions can be slow?

#systemdesign #backend #database #microservices

There are certain times when a particular problem requires orchestrating updates to multiple different services/resources/shards together as an atomic unit of work. If any of the updates fail, the entire work unit needs to roll back because if partially executed, the state of the system as a whole can become inconsistent. This is the scenario where Distributed Transactions can prove to be one of the viable solutions. While in a monolithic or a non-sharded single resource, Transactions might be a great choice, they can cause some serious performance impact when executed within a distributed or a sharded system.

Behind the scenes: 2-Phase Commit

Well, it’s difficult to write about every protocol used by every system available on this planet. So, for the sake of simplicity, this text will focus on one of the most popular protocols used for distributed transactions — 2-Phase Commit protocol (or 2PC).

2PC as the name suggests makes a distributed transaction happen in two steps:

Step 1: Prepare

The orchestrator of the system sends a "prepare” message along with instructions in the given transaction to all the participants of the system. Each participant makes a log of these instructions and responds with an acknowledgement if it is ready to execute them.

Step 2: Commit

Once the orchestrator receives the acknowledgement from all the participants, it sends a “commit” message to all of them, requesting each of the participants to make a change in their state as per the instructions in the given transaction. However, if any of the participants don’t send an acknowledgement for the “prepare” phase, all of the remaining participants are told to roll back.

Besides this, the locks also are held on the system resources to ensure serializability.

Fig. 2PC— Successful commit

Fig. 2PC — Rollback (Unsuccessful commit)

Slow? Where?

By now, someone familiar with distributed systems will have a fair idea about the throttling areas that can slow down distributed transactions.

Resource Failure

In 2PC, the success of the transaction depends on whether all participants can acknowledge the prepare phase. If any participant/resource crashes or loses connectivity, the orchestrator waits until timeout which increases the latency and eventually triggers a rollback in which case the transaction needs to be restarted.

Network Latency

As multiple network round trips are required for each phase of 2PC including the retry attempts to establish a connection with unresponsive participants, any increment in network latency can cause a significant delay (or maybe timeout) in the transaction. In addition to this, an increase in the number of participants in the system will also impact the latency as the orchestrator is required to wait for all the participants to acknowledge the prepare phase.

Orchestrator Failure

In case the orchestrator fails before notifying the participants about the commit, participants may keep waiting (until timeout) for the orchestrator to recover. This can block the system from proceeding by halting the release of the locks.

This article was originally published in The Scalable Thread newsletter. If you enjoyed reading it, please consider subscribing FREE to the newsletter to support my work and to receive the next post directly in your inbox.

DEV Community

Why Distributed Transactions can be slow?

Behind the scenes: 2-Phase Commit

Step 1: Prepare

Step 2: Commit

Slow? Where?

Resource Failure

Network Latency

Orchestrator Failure

Top comments (0)

Read next

Running Keycloak on Docker for the First Time

Neon Flutter Starter Kit: Turbocharge Your App Development with Seamless Integration

Microservice Design Patterns

Dockerizing your Java Spring Boot application with Maven, along with a PostgreSQL database