DEV Community

Cover image for Setting a timeout in distributed services is a good practice to avoid side effects
Stephany Henrique A
Stephany Henrique A

Posted on

Setting a timeout in distributed services is a good practice to avoid side effects

You can be working with monolithic systems and even so you may have a service distributed that helps your system in some tasks. My tip is: take care of this integration because your system will be harmed at some point, and in this article, I will explain why.

In the last week, my team and I had a problem with an integration that our system makes with another system. That's why I will explain what the system does before how I discovered the problem.

The system that I work with has the function of manage documents of the client. The client inserts documents, and any time it can be searched or signed with a flow that there are in the application. Basically this is the system core with more features 😃.

When I insert a document, it needs to be stored somewhere, and for this, we use the min.io tool. With min.io the application can store in any place like Google, Azure, AWS, Local, and others. I do not need to create a code instruction for each vendor. We are using min.io for about 1 year and I appreciate it because I do not need to waste time creating something that min.io already implements, I can focus on the application core.

But this week the client reported a problem to view documents, it was very slow. My first thought was to see the min.io server and to my surprise, it was working. So, what was the problem? In order to analyze better, I opened the Application Insights Monitor from Microsoft, and there I saw that the connection with min.io was slow.

Usually, the min.io connection responds in 50ms. I know this because I see this information at Application Insights. But on that day, it was responding in 10min. What the fuck? In short, the problem was between my system and min.io, would it be network? I think so.

If you do not have a monitoring system, you can not discover this problem.

Alt Text

I started the see all connections with min.io and I saw many long connections, you can note in the image above one example, I saw many equals this in that day. In my conclusion the min.io was not the problem, the problem was the network since at the min.io server everything was working.

What problem did that bring? My system was waiting for a response from min.io, therefore it was waiting for many minutes. You must know that each web server has a finite number of threads to process requests from users and with this problem, many threads were stuck with the connections from min.io. The side effect was that my system could not serve other users. Did you understand the problem?

If you read the Fallacies of Distributed Computing, you will see that network is not reliable, so we must prevent any problems that may happen, but how? Let's thought together, I am not the owner of min.io neither of the network, but I am the owner of my system, therefore, we can work to our system get "smarter".

Set a timeout

You already know that any system will fail or gets slow at some point, so we must prevent this, for that not happen, we can use the Timeout pattern to cut the connection and helps the system to fail
"gracefully".

There are many timeout tools to many languages, I work with .Net and I can use the Polly, http://www.thepollyproject.org/. My tip is, do not invent the wheel, never try to create your own tool when it already exists. Repeat with me, "I will not create my own tool".

How I can use the polly?

First, you must install the Polly in your project, then we create a "policy", according to the code below:

var timeOutPolicy = Policy.TimeoutAsync(10, TimeoutStrategy.Pessimistic)

Now we execute our code wrapped by a timeout police.

await timeOutPolicy.ExecuteAsync(() => 
    _clienteMinIo.GetObjectAsync(_nomeDoBucketNoMinIo,
      $"{_nomeDaPastaNoMinIo}/{nomeDoArquivoComCaminho}",
    arquivoRetornado => ObterDadosDo(arquivoRetornado)));

Note that I created a policy that has a timeout with 10s, if in 10s our code, "GetObjectAsync", it is not executed, the policy starts and ends the task.

Did I finish my job? No, there is a problem still. The min.io is executing in an async way, so even when I finish the thread, the connection with min.io server still occurs. So I created the code below to cancel the connection using a CancelationToken. With Polly timeout, we can use the func "OnTimeoutAsync" and execute some code when the timeout occurs.

var cancellationTokenSource = new CancellationTokenSource();
var cancellationToken = cancellationTokenSource.Token;
var timeOutPolicy = Policy.TimeoutAsync(10, TimeoutStrategy.Pessimistic, 
        async (context, time, task) =>
        {
            await Task.Run(() => cancellationTokenSource.Cancel());
            throw new Exception("Min.io não responde");
        });
await timeOutPolicy.ExecuteAsync(() => _clienteMinIo.GetObjectAsync(_nomeDoBucketNoMinIo,
    $"{_nomeDaPastaNoMinIo}/{nomeDoArquivoComCaminho}",
    arquivoRetornado => ObterDadosDo(arquivoRetornado), null, cancellationToken));

The CancelationToken class is something from C# and is normal to using this when we work with async methods. With the code above, the connection is canceled when I call the instruction "cancellationTokenSource.Cancel()".

How I tested my code in my machine?

In my machine, the min.io works perfectly, but I want to test when I have a long connection or a bandwidth problem. For this, I used https://github.com/Shopify/toxiproxy, a nice tool that helped me to simulate a connection problem with min.io. With Toxi Proxy you can test your application in many chaos scenarios.

Alt Text

In this image, you can see, in my machine, my document upload that had a duration of 15min. This test I did before I insert Polly in my project.

Does Min.io still have a network problem?

As I said before, I am not the network owner, therefore I opened a ticket with those who take care of the network to see the problem. But now, my application does not have slow connections and the threads are free.

I hope I helped you with a problem that happened to me this week. See you later!

Top comments (1)

Collapse
 
aregaz profile image
Illia Ratkevych

Wow, thanks for sharing. I already use Polly but more for retries. I had similar issues with long requests, but in that case, we faced ThreadPool starvation since threads weren't able to be released if the task took time more then couple dozen seconds.