ChunTing Wu

Posted on Mar 14, 2022

Design Distributed Transaction with Practical Examples

#tutorial #distributedsystems #architecture #productivity

Last time, we have discussed about, how to prepare a design review like an expert. There are three items should be prepared.

C4 model
User stories and use cases
Design decisions

In this article, I will use a practical example to show you what is a design review look like. Some discussion with too many details will be skipped, and only demonstrate the critical designs.

User Stories

First, we are going to define the user story. Like the previous article mentioned, user story aligns the context perspective in C4 model. Therefore, we write down the user stories clearly.

This time we are going to do the function of giving gifts. And, the whole story is as follows.

A User can decide how many of the same gifts they want to give to others at a time.
As long as the user has enough money, then there is no limit to the number of people giving gifts.
After finishing giving gifts, it must inform both sender and receivers that everything has been done.

Use cases

The entire gift-giving scene is clearly described in the user story, however, there are some details that are not sufficiently described. For example,

If the user's balance is not enough, then all gifts will fail and cannot be partially successful.
Finishing giving gifts represents all receivers have received the gift.
The content of the notification must contain sender, total amount of gifts and all receivers.
No matter how many people to give, the entire process should be finished in a few minutes.

As we can see above, in the use cases, we added details that were not in the story, and presented the whole scene in more detail.

C4 Model

Now, we can draw the C4 model of the entire design.

Context

Based on the user story, we draw a context to describe the interaction between the user and the system. From the context, we know that there are several key points that must be fully discussed.

The line that contains the gift
Behavior of the Server itself
The line containing the notify

You may ask how about the notification service and receivers. The answer is simple. That is a third party service, and we cannot control its behavior. Therefore, there is no need to discuss it in a design review on the topic of giving gifts. Of course, if there is any concern about the notification, we can hold another design review to dig in.

Container

Once we have the context, we start to dive into the three points listed above and expand them in the form of a container.

Finally we got this diagram, this is what it looks like when it is finished. There are many design decisions here, but in this section, I will only introduce the meaning of this diagram. As for the design decisions, we will leave the analysis in depth in the later sections.

Users send gift requests, including what to send and whom to send.
After receiving the request, Server first debits the user's wallet and responds directly to nak if the balance is not enough, then divides the request into batches of fixed size and writes the batch information to the database. After that, the batch task is sent to the worker for execution asynchronously with the serial number of the transaction.
Although the process is not complete, the preparations have been done, so reply to the user ack to show that the whole process is in progress.
When the worker receives the command, it focuses on the batch task it is assigned, and when it finishes, it deducts the counter from the database. If the worker encounters any errors, just retry the task itself.
When the counter is found to be 0 after deducting the counter, which means everyone has finished the task, then the last worker will notify all the receivers.

Component and Code

From the perspective of the container, the next steps are component and code, but these are already related to some implementation details, and each system faces different problems, so I won't go into more detail here.

Design Decisions

We found many details in the container diagram. As I did in the previous article, we will use the why do A instead of B formula to ask a lot of questions.

Why sender and server communicate with each other in a semi-asynchronous way (between synchronous and asynchronous)?
Why is the server divided into an orchestration and workers?
Why orchestration and workers are completely asynchronous?
Why is the worker notifying the receivers?
Maybe you are able to consider some questions I have never listed.

All of these questions should be told from the original architecture.

At the beginning, we have already the feature of giving a gift to a person. So, the simplest way is performing a for-loop to send all receivers through the one-to-one gift regardless of the receiver number.

This is fine when there are only a few receivers, but once the number of receivers starts to increase, performance will be a serious challenge. In our measurements, it takes about 100-200 ms to deliver a person, not including notifications, which means that when the number of people reaches 10, it will reach the second-magnitude. This is obviously unacceptable.

It seems that the batching is inevitable, so someone drew up the first architecture diagram.

From the diagram, we can find that the orchestration has already been appeared. However, the communication with the worker is still synchronized, and the notification will be sent only when all the tasks are done. It will not reply to the user until all the tasks are finished. This may seem to have significantly reduced the performance bottleneck, but it hasn't gotten better at all.

Let's do a simple math. Suppose we want to send gifts to 1000 people, how do we set the batch size and the number of workers?

To finish it in seconds, the maximum batch size is 10, so 100 workers need to be generated at the same time to handle a gift request. This is a very strict challenge for the system, and it is not an easy task to generate 100 workers in an instant. Because of this, it is difficult to keep users waiting in a fully synchronized manner. So, what happens when it is completely asynchronized?

In the second attempt, we changed the orchestration to choreography, so that the user could get the response in a very quick time and the gift could be sent smoothly. But, is it really so?

What happens if the middle worker fails? The whole chain is broken, and the user may not feel it without notification. It is fine, but for the giver, the middle successful worker has already deducted the user's money, but the notification is not sent. Moreover, back to the first point of the use cases, partial success is not allowed.

Choreography compared to orchestration will indeed have better performance and better scalability, but will get more complex workflow control. Therefore, in this use case, orchestration would be more appropriate.

So let's implement full asynchronization through orchestration.

This architecture encounters two problems immediately.

Who should send the notification?
How to do if the money is not enough in the half way?

The problem of sending notifications is well solved, as in the previous container view in the C4 model, has actually solved the problem of who to send notifications. Nevertheless, it is basically impossible to handle the error in the half way to a completely asynchronous architecture.

Due to this reason, we finally adopted a semi-asynchronous approach. First, the orchestration determines whether the balance is enough to send, and in order to avoid the racing condition, the money is deducted directly. So the worker only needs to handle sending the gift, not to deduct the money from the giver, not even to check the balance.

Error Handle and Disaster Recovery

Under the such architecture, there is a huge trouble.

How about the worker failed?

If it is occurred due to database congestion, it should be okay after just retrying several times. Otherwise, if there is a malfunction in the implementation, it cannot be resolved even though retry many times.

As a result, an additional monitoring system needs to exist to periodically check which asynchronous tasks have failed, and retry those that can be retried, or notify human intervention if they cannot recover by retries. The applicable solution is in a previous article I have already introduced, so I won't explain too much here.

Conclusion

To sum up, the system breaks down gift-giving into several steps.

The user sends the request synchronously and the balance is deducted in advance.
All processes of giving gifts are performed asynchronously.
The process of giving gifts can be consistent eventually, and the money deducted is equal to the money issued.

In this article, we discuss challenges while designing a distributed system in a practical example.

Synchronous vs. Asynchronous
Orchestration vs. Choreography
Atomic vs. Eventual consistency

In fact, those items are the trade-offs in a typical distributed transaction as well. These aspects can significantly affect the overall distributed transaction model. I have introduced distributed transaction in my previous article. Next time, I'll take a closer look at the challenges and issues faced when designing a distributed system.

DEV Community