DEV Community

loading...
Cover image for Adding Fault Tolerance to your Quarkus Microservice with SmallRye

Adding Fault Tolerance to your Quarkus Microservice with SmallRye

Anca
Looking for interesting reads and giving writing a try. I like to demo and try out stuff.
Updated on ・10 min read

Introduction: Initial problem...

As we're gravitating more and more towards microservice architectures we notice that they tend to be inherently unreliable, and depending on the use case, even small outages could potentially make users impatient or determine them to seek better service somewhere else.
Although microservices bring their advantages to the game, they tend to rely a lot on each other and have interdependencies, so the issue of one of them potentially plummeting needs to be addressed and handled.

... and solution

It is becoming increasingly important that we build fault tolerant microservices, but we would also like tools that can make dealing with unreliability a little easier, all while allowing us to focus on the business logic, rather than ending up with stressful, intricate if-then-else statements.
This is where SmallRye Fault Tolerance comes into play by providing a standardized set of common failure-handling patterns via annotations that can make code readable and maintainable.
SmallRye Fault Tolerance is an implementation of Eclipse MicroProfile Fault Tolerance. It was originally based on Hystrix, the Netflix library for latency and fault tolerance in distributed systems, which heavily promoted the concepts of fault tolerance and isolation for microservices.
Hystrix has long been a very popular library in the industry, but the community activity has been declining and recently it came to a stop by announcing it would stop maintenance.
SmallRye Fault Tolerance is an easy-to-use extension that provides us with retry policies, timeouts, and circuit breakers that dictate whether and when executions should take place. It also includes fallbacks, which offer an alternative implementation when an execution does not complete successfully.

Let's have a look at it

What I'm using

  • Java 11
  • Apache Maven 3.8.1
  • IntelliJ IDEA
  • macOS Catalina

We'll simulate a little business logic in a delivery/logistics company that is dealing with managing and tracking transports. This is an overly simplified backend example, while also keeping itself realistic. The reason I chose a more complex model, rather than just a basic entity to pass around, is that I would like it to reflect the day-to-day image of an enterprise project.
I've talked about spinning up a Quarkus application here. You can add your extensions wherever you decide to create your project, be it directly on quarkus.io, in your IDE or via Maven command, like so:

mvn io.quarkus:quarkus-maven-plugin:1.13.4.Final:create \
                                       -DprojectGroupId=org.tinyg \
                                       -DprojectArtifactId=microprofile-fault-tolerance-quickstart \
                                       -DclassName="org.tinyg.TransportRest" \
                                       -Dpath="/transports" \
                                       -Dextensions="resteasy,smallrye-fault-tolerance,resteasy-jackson"
Enter fullscreen mode Exit fullscreen mode

In an already existing Quarkus project you can just add the extension by running the following in your base directory:

./mvnw quarkus:add-extension -Dextensions="smallrye-fault-tolerance"
Enter fullscreen mode Exit fullscreen mode

The full project can be found here, in the "initial" folder there will be a clean setup that needs to be enhanced and in the "final" folder we'll see the end version, with the desired improvements.
Let's have a quick look at the classes. For readability reasons I decided to make these sections collapsible so that they can be extended as needed.
Let's run the development server and see if it works:

./mvnw compile quarkus:dev
Enter fullscreen mode Exit fullscreen mode

Model
import lombok.Getter;
import lombok.Setter;

import java.time.LocalDate;
import java.util.HashSet;
import java.util.Set;

@Setter
@Getter
public class Transport {

    private Long id;

    private String trackingId;
    private Set<Container> containers;

    private LocalDate departureDate;
    private LocalDate arrivalDate;

    public Transport(final Long id, final String trackingId, final LocalDate departureDate, final LocalDate arrivalDate) {
        this.containers = new HashSet<>();
        this.id = id;
        this.trackingId = trackingId;
        this.departureDate = departureDate;
        this.arrivalDate = arrivalDate;
    }

    public void addContainer(final Container container) {
        containers.add(container);
    }

}
Enter fullscreen mode Exit fullscreen mode
import lombok.Getter;
import lombok.Setter;

import java.util.ArrayList;
import java.util.List;

@Setter
@Getter
public class Container {

    private Long id;

    private Transport transport;
    private List<ContainerLoad> containerLoad = new ArrayList<>();
    private ContainerType type;
    private String serialNumber;

    public Container(Long id, ContainerType type, String serialNumber) {
        this.id = id;
        this.type = type;
        this.serialNumber = serialNumber;
    }

    public void addContainerLoad(final ContainerLoad containerLoad) {
        if (ContainerType.PARTIAL_LOAD.equals(this.getType()) || this.containerLoad.isEmpty()) {
            this.containerLoad.add(containerLoad);
        }
    }
}
Enter fullscreen mode Exit fullscreen mode
import lombok.Getter;
import lombok.Setter;

@Setter
@Getter
public class ContainerLoad {

    private Long id;

    private Container container;
    private Double weight;
    private Double height;
    private Double width;

    public ContainerLoad(Long id, Double weight, Double height, Double width) {
        this.id = id;
        this.weight = weight;
        this.height = height;
        this.width = width;
    }
}
Enter fullscreen mode Exit fullscreen mode
public enum ContainerType {
    FULL_LOAD,
    PARTIAL_LOAD;
}
Enter fullscreen mode Exit fullscreen mode

Service

import org.jboss.logging.Logger;
import org.tinyg.dao.TransportRepo;
import org.tinyg.exception.KnownProcessingException;
import org.tinyg.model.Transport;

import javax.enterprise.context.ApplicationScoped;
import javax.xml.ws.http.HTTPException;
import java.time.LocalDate;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.concurrent.atomic.AtomicLong;
import java.util.stream.Collectors;

@ApplicationScoped
public class TransportService {

    private static final Logger LOGGER = Logger.getLogger(TransportService.class);
   private static final Random RANDOM = new Random();

    private TransportRepo transports;

    private AtomicLong counter = new AtomicLong(0);

    public TransportService() {
        transports = new TransportRepo();
    }

    public List<Transport> getAllTransports() {
        extraValidationOrProcessing();
        LOGGER.info("Information fetched successfully.");
        return new ArrayList<>(transports.values());
    }

    public List<String> getValidTrackingIds() {
        possibleFail();
        return transports.values().stream().filter(transport ->
                LocalDate.now().compareTo(transport.getArrivalDate()) <= 0).map(Transport::getTrackingId).collect(Collectors.toList());
    }

    private void possibleFail() {
        final Long invocationNumber = counter.getAndIncrement();
        if (invocationNumber % 4 > 1) { // alternate 2 successful and 2 failing invocations
            throw new RuntimeException("Service failed.");
        }
    }

    private void extraValidationOrProcessing() {
        // let's pretend there's a call to a different user service to check status/association with transports'
        // companies, etc
        if (RANDOM.nextInt(50) % 2 == 0) {
            LOGGER.error("The required information could not be retrieved right now.Try again.");
            throw new HTTPException(408);  //408 request timeout
        }
        if (40 <= RANDOM.nextInt(50)) {
            LOGGER.error(String.format("Something went wrong here at %s", LOGGER.getName()));
            throw new KnownProcessingException();
        }
    }

}

Enter fullscreen mode Exit fullscreen mode

Rest

import org.jboss.logging.Logger;
import org.tinyg.exception.KnownProcessingException;
import org.tinyg.model.Transport;
import org.tinyg.service.TransportService;

import javax.inject.Inject;
import javax.ws.rs.GET;
import javax.ws.rs.Path;
import javax.ws.rs.Produces;
import javax.ws.rs.core.MediaType;
import java.util.Collections;
import java.util.List;

@Path("/transports")
public class TransportRest {

    private static final Logger LOGGER = Logger.getLogger(TransportRest.class);

    @Inject
    TransportService transportService;

    @GET
    @Produces(MediaType.APPLICATION_JSON)
    public List<Transport> transports() {
        try {
            return transportService.getAllTransports();
        } catch (KnownProcessingException exception) { //we'll catch it like a bunch of responsible devs
            LOGGER.error(exception.getMessage());
            return Collections.emptyList();
        }
    }

    @GET
    @Path("/valid")
    @Produces(MediaType.TEXT_PLAIN)
    public List<String> getValidTransports() {
        return transportService.getValidTrackingIds();
    }

}
Enter fullscreen mode Exit fullscreen mode

The model consists of 3 entities: Transport, which contains a set of Containers, which in turn, depending on their types, can contain one or more ContainerLoads.
The simple CDI bean will contain a data repository object. In real life, this would be handling queries to a database, but since that is not the main objective of this post, we'll do it like this for now. Finally, the TransportRest class is where we'll expose our endpoints, which we can call via Postman (or CURL command, or a browser) to get information about the transports. One of the REST methods is meant to be faulty, causing issues when trying to retrieve the list of available transports.

Retries

Sometimes one of our microservices may not behave as expected and has the potential to impact others if it becomes faulty. But not all situations are showstoppers and failures can be easily avoided just by calling the same service one more time. To prevent disturbing the upstream or the downstream, we can use the Retry mechanism, which will recover an operation from failure by invoking it again until it reaches its stopping criteria. To use this feature we just add the @Retry annotation to the faulty method, and by default we're introduced to some of the configuration parameters: maxRetries, retryOn, abortOn, delay. Let's explore these a little bit.
In the initial setup, an exception would be thrown some of the times when we try to retrieve a list of transports. We can annotate that method with @Retry, and based on what needs to be addressed, we can also configure it:

@Retry(
            retryOn = HTTPException.class, 
            maxRetries = 5,
            abortOn = KnownProcessingException.class,
            delay = 500
    )
Enter fullscreen mode Exit fullscreen mode

What we're saying here is that we want to retry calling this method in case an HTTPException occurs. This one is pretty generic, but one can choose a more specific exception depending on their use case. The maximum times to retry this call will be 5, the delay between calls is 500 milliseconds, and under no circumstance should it be done again when an KnownProcessingException arises. The last exception is a custom one, just for demonstrating the functionality, but it highlights that there are ways of dealing with known, intentional failures as well (such as maintenance, for example). So now things will be different, upon trying to retrieve our list of transports, the console will look different and although communication with another service is not perfectly smooth, the user will not be aware of it. And if something happens that our functionality cannot continue, then that exception will be handled accordingly and the return value will be an empty list.
Retry

Timeouts

Let's imagine we have a functionality in the application that could display how much CO2 a means of transport is producing from point A to point B, based on the current load of containers and some other regional environmental factors. This is definitely not a crucial feature, it's mainly there for the users' information and also to show a commitment that future improvements are coming. So when the transport is being tracked, an updated panel of emissions will also be displayed.
In case the system is overloaded and the logic behind obtaining this data is too costly in terms of time, we can just go ahead without it and render the UI components.
We enhance our application with a new endpoint and a service class that will take care of that:

EmissionsCalculatorRest

import org.eclipse.microprofile.faulttolerance.Timeout;
import org.jboss.logging.Logger;
import org.jboss.resteasy.annotations.jaxrs.PathParam;
import org.tinyg.service.EmissionsCalculatorService;

import javax.inject.Inject;
import javax.ws.rs.GET;
import javax.ws.rs.Path;
import java.util.Random;
import java.util.concurrent.atomic.AtomicLong;

@Path("/emissions")
public class EmissionsCalculatorRest {

    @Inject
    EmissionsCalculatorService emissionsCalculatorService;

    private static final Logger LOGGER = Logger.getLogger(EmissionsCalculatorRest.class);
    private static final Random RANDOM = new Random();

    AtomicLong attempts = new AtomicLong(1);

    @GET
    @Path("/{trackingId}")
    @Timeout(2500)
    public String calculateEmissions(@PathParam String trackingId) {
        long started = System.currentTimeMillis();
        final long invocation = attempts.getAndIncrement();

        try {
            randomDelay();
            LOGGER.infof("EmissionsCalculatorService invocation #%d returning successfully", invocation);
            return emissionsCalculatorService.calculate(trackingId);
        } catch (InterruptedException e) {
            LOGGER.errorf("EmissionsCalculatorService invocation #%d timed out after %d ms",
                    invocation, System.currentTimeMillis() - started);
            return null;
        }
    }

    private void randomDelay() throws InterruptedException {
        Thread.sleep(RANDOM.nextInt(5000));
    }
}


Enter fullscreen mode Exit fullscreen mode

EmissionsCalculatorService

import javax.enterprise.context.ApplicationScoped;

@ApplicationScoped
public class EmissionsCalculatorService {

    public String calculate(final String trackingId) {
        //Fetch transport based on trackingId.
        //Figure out the load.
        //Try to locate it based on last checkpoint.
        //Retrieve weather conditions, etc
        //Actual calculations...
        return "CO2 report has been computed.";

    }
}

Enter fullscreen mode Exit fullscreen mode

An intentional delay has been set to occur somewhere between 0 and 5 seconds (normally we would not want to have such long waiting times, this is just easier to perceive as a human) and the timeout has been configured to 2.5 seconds.
Upon requesting this information a few times, we can see in the logs that some of them will time out with an org.eclipse.microprofile.faulttolerance.exceptions.TimeoutException. The successful ones will fetch us the information: "CO2 report has been computed.".
Timeout

Fallbacks

Now is the best time to bring in a fix for the previous org.eclipse.microprofile.faulttolerance.exceptions.TimeoutException. Let's annotate the method with a @Fallback annotation and provide another function that could take over in case of failure. In our case, it can provide a simpler and faster calculation of emissions, or it could return a generic number that generally falls somewhere in the average values.

@GET
    @Path("/{trackingId}")
    @Timeout(2500)
    @Fallback(fallbackMethod = "genericInformationAboutCO2")
    public String calculateEmissions(@PathParam final String trackingId) {
...

}
Enter fullscreen mode Exit fullscreen mode

Keep in mind that the fallback method needs to match the original one in terms of parameters, it would look something like:

private String genericInformationAboutCO2(final String trackingId) {
        return "An average consumption of 4,2 kg / 100 km corresponds to 4,2 kg x 2666 g/kg = 112 g of CO2/km.";
    }
Enter fullscreen mode Exit fullscreen mode

By requesting this emissions data a few more times, we can see how well the fallback is handling it:

Fallback

Circuit Breakers

If a part of our system becomes temporarily unstable we might want to cut access to it for a while in order to limit the number of failures. The job of a circuit breaker is to record successful and failed invocations of a method, and when the number of fails reaches a defined threshold, it cuts off invocations to that area. In the TransportService class we will have:

@CircuitBreaker(
            requestVolumeThreshold = 4,
            failureRatio = 1 / 2,
            delay = 10000
    )
    public List<String> getValidTrackingIds() {
        possibleFail();
        return transports.values().stream().filter(transport ->
                LocalDate.now().compareTo(transport.getArrivalDate()) <= 0).map(Transport::getTrackingId).collect(Collectors.toList());
    }

    private void possibleFail() {
        final Long invocationNumber = counter.getAndIncrement();
        if (invocationNumber % 4 > 1) { // alternate 2 successful and 2 failing invocations
            throw new RuntimeException("Service failed.");
        }
    }
Enter fullscreen mode Exit fullscreen mode

By checking out the configuration parameters, we will see that requestVolumeThreshold is 4, meaning that the last 4 invocations will be checked. The failureRatio is 1/2 (default value), which indicates that a circuit breaker will open when 2 consecutive calls of the last 4 invocations fail. We set the delay to 10 seconds, which is how long the circuit breaker will stay open. Ten seconds was the value chosen so it can be easily perceived.
We can see in the logs that in this 10-second interval the circuit breaker is open and blocking any execution from happening.
Circuit Breaker

When services are collaborating synchronously, there is a possibility that the called service may exhibit unavailability or high latency to the point that it becomes unusable. This can lead to precious resources such as threads and time being consumed for no reason, sometimes even going as far as exhaustion. A failing service in the application could potentially cascade into the failure of other services throughout the system, that's why it is important to be aware of possible downfalls and be prepared.

Adding a Fallback to your circuit breakers

Once we implement a circuit breaker, it might be desirable to also provide an alternative for the downtime. That can be achieved by supplying a fallback class. We once more use the @Fallback annotation:

@CircuitBreaker(
            requestVolumeThreshold = 4,
            failureRatio = 1 / 2,
            delay = 10000
    )
    @Fallback(CircuitBreakerFallback.class)

    public List<String> getValidTrackingIds() {

   ...
Enter fullscreen mode Exit fullscreen mode

We then implement the FallbackHandler interface in our fallback class, where we override the "handle" method. Keep in mind that "handle" has to have the same return type as the annotated method.


import org.eclipse.microprofile.faulttolerance.ExecutionContext;
import org.eclipse.microprofile.faulttolerance.FallbackHandler;

import java.util.Collections;
import java.util.List;

public class CircuitBreakerFallback implements FallbackHandler<List<String>> {

    @Override
    public List<String> handle(ExecutionContext executionContext) {
        return getValidTrackingIds();
    }

    public List<String> getValidTrackingIds() {
        return Collections.singletonList("Has reached fallback for circuit breaker. You can provide and alternative " +
                "here.");
    }
}
Enter fullscreen mode Exit fullscreen mode

We can now see that although the circuit breaker is open, an alternative implementation can take over and deal with the request.
Circuit breaker fallback

Conclusions

And there you have it, we have seen some of the most useful and easy-to-use ways of keeping our applications afloat, even when it keeps running into trouble.

Resources

Discussion (0)