DEV Community

Cover image for Skopos: Monitor Critical API Workflows
Nykaela Dodson
Nykaela Dodson

Posted on

Skopos: Monitor Critical API Workflows

What is Skopos?

Skopos is an open-source API monitoring tool designed for testing multi-step API workflows and running groups of tests in parallel.

Meet the Team that built Skopos

Nykaela Dodson
Hans Elde
Katherine Ebel
Gagan Sapkota

What is an API?

APIs are at the heart of every piece of modern software in use, and they act as the connective tissue that binds together application services.

Consider a weather application: when a user clicks a button to check the weather in Seattle, the weather app sends a GET request to the weather API. Next, the weather API sends back a response with the requested data. The weather app then uses that data to display the current weather to the user.

API Example

It is common for an application to rely on multiple API endpoints.

What happens when APIs fail?

When one API fails, other dependent APIs may fail too with cascading failures that can have unfortunate consequences. Let’s take a look at one example where an API failure had a noticeable impact.

In 2019, UberEats customers were able to order huge amounts of free food because one of UberEats' payment services, PayTm, changed one of their API endpoints. Before the change, the endpoint returned the same response every time a customer placed an order with insufficient funds to pay for the order. After the change, the endpoint was no longer idempotent1. After the first attempt to place an order with insufficient funds, on subsequent attempts the PayTm endpoint returned a new, unexpected error message.

Because of the way UberEats had integrated their application with PayTm, this unexpected error message allowed orders to go through, even though payment had not actually been processed successfully. Before UberEats realized the error, many customers had already ordered thousands of dollars worth of food for free.

UberEats Example

It is important to identify issues like this as early as possible if and when they occur in production. Fortunately, API failures can be traced and when APIs fail, their failures manifest in a few different ways:

  1. Data Payload: As we just saw with UberEats, the data payload sent back in the API's response might be unexpected.

  2. Response Time: The failure might show up as an unacceptable response time.

  3. Status code: The HTTP status code might be wrong.

To catch failures as quickly as possible, companies invest in API Monitoring tools that track the performance of API endpoints and look for these signs of failure. API monitoring tools detect API failures by making requests to endpoints at specified intervals and checking the validity of their response data2.

Whenever there is a mismatch between the expected response data and the actual response data the endpoint is considered to have failed.

What is API Monitoring?

API monitoring is the process of making requests to API endpoints at set intervals and comparing the response to expected values to check both the availability of API endpoints and the validity of their response data. The goal of API monitoring is to spot issues that may affect users as early as possible.

We think it's helpful to think of how API monitoring tools work in the context of a set of core functionalities that are shared by pretty much all API monitoring tools. These core functionalities can be broken down at a broad level as a set of distinct steps: Definition, Execution, Scheduling, and Notification. An API monitoring tool allows users to define tests, execute those tests on a schedule, and notify various targets when the tests fail.

Definition

Definition

For defining tests, a user provides the information necessary for communicating with the API endpoint. This might include the HTTP method, endpoint URL, headers, and request body. A user then defines assertions to compare expected and actual responses. An assertion is how the user actually specifies what status code they expect, what response time is reasonable, or what they expect in the response body. Definition functionality is generally offered either through a graphical user interface or a command line tool.

Execution

Execute

An API monitoring tool then has to execute the tests. This means the tool sends requests to the specified API endpoint, receives a response, and checks assertions associated with that test against the received response.

Scheduling

Schedule

The tool must also be able to schedule the tests to run. In this case, scheduling is not restricted to time-based scheduling but also refers to arranging for tests to execute from different geographical locations or setting a deployment trigger to execute tests as part of a CI/CD pipeline. For example, tests scheduled based on time can be set to execute every minute, every 15 minutes, or every hour depending on the use case and how quickly API failures should be responded to.

Notification

Notify

When APIs return unexpected responses, the monitoring tool must be able to alert interested parties about the failures. This could include internal notifications that alert self-healing processes, or external notifications to users of the monitoring tool through integrations with PagerDuty or Slack.

While API monitoring tools have these core commonalities, they are not generally one-size-fits-all products, and these functionalities – definition, execution, scheduling, and notification – can be fine-tuned to suit different use cases. Knowing which features to target with an API monitoring tool depends on which approach to API monitoring makes the most sense for that specific use case.

Introducing Skopos

Skopos Logo
While API monitoring tools have these core commonalities, they are not generally one-size-fits-all products, and these functionalities – definition, execution, scheduling and notification – can be fine-tuned to suit different use cases. Knowing which features to target with an API monitoring tool depends on which approach to API monitoring makes the most sense for that specific use case. Here are some of the key functionalities we hoped to provide with Skopos to simulate workflows that rely on multiple APIs.

Multi-Step Tests

Multi-Step

Multi-step tests are designed to simulate complex workflows that consume multiple APIs. This is typically used when different services of an application need to communicate with one another in sequence over API calls to complete common functionality, such as an API endpoint requiring authentication through a token3. API call chaining like this is particularly prevalent with a microservices architecture. Consider a user workflow that includes adding an item to a cart, making a payment, scheduling a delivery, and updating the database.

When all of these steps work as expected, the test passes. However, when one or more of these steps do not work as expected, for example, if the payment step fails, the test would fail.

Parallel Test Execution

Parallel

Another approach is parallel tests. Although it is necessary to simulate multi-step workflows by running tests sequentially, the disadvantage of this is that it takes extra time. For example, executing three tests that take 200 milliseconds each would take 600 milliseconds. Meanwhile, the same three tests sent in parallel would take only 200 milliseconds. If these tests are being sent to different endpoints, this feature can save time when running a large number of tests4.

Executing tests in parallel can save computing time when the requests do not depend on each other. Furthermore, making parallel requests can also be used for load testing because you can configure multiple tests to make a request to the same API endpoint in parallel to see how the endpoint performs.

multi step collections run in parallel


While collections could be executed in parallel,
multi-step tests within each collection should be executed sequentially.

Skopos is open source, has multi-step and parallel test functionality, and offers a user-friendly GUI that meets the needs of our use case.
Skopos’ Core Application Components
For Skopos’ core application, we knew we needed to build the definition functionality and provide a way to set up multi-step tests. An important part of this is referencing values from previous tests. This was challenging, because those values are not accessible until after the previous test has completed.

We also noticed we were working with a large amount of data. Storing and retrieving that data quickly became a complex challenge. Let's look into the first challenge and how we reference values downstream.

Creating Reference Flags

We decided to group tests that reference previous values from other tests together into what is called a collection. A collection is a group of tests. This way, when the tests within a collection are run sequentially, values needed downstream will be available.
Collections in Parallel


A value from the response of test 1 can be accessed by test 2 or 3,
but not by tests in another collection, such as tests 4, 5, and 6.

How does a test know that the user wants to interpolate a previous value? We decided to create a reference flag to identify where to interpolate values using @{{}}.This is how we solved the challenge of defining where a user wants to access previous values to then be interpolated later on.

Data Storage and Fetching

During development, we started by working with REST endpoints, however, this became limiting. Sometimes, we were not able to fetch all of the data we needed from one endpoint, or under-fetching, and other times we were fetching more data than we needed, or over-fetching.

First, we tried to adjust our queries to target the specific data we needed, however, queries started growing in complexity, and the custom endpoints we used started to move away from proper REST implementation.

Apollo Stack

We decided to use GraphQL because it allowed us to retrieve the precise data we needed. In particular, we added Apollo Server to the backend of our application. Any components that would need to communicate with the database could then use Apollo Client, and the backend running Apollo Server would act as the single gateway to the database.

This was a great start, however, while implementing our data model, we ended up making frequent updates to our schema. To mitigate this issue, we decided to use Prisma. Prisma is an object-relational mapper, or ORM. It allowed us to not only interact with the database as if it were an object but also update and migrate our schema with ease.

Full Data Stack

Here is our final data stack, which consists of Apollo client communicating from the frontend, a backend running Apollo Server integrated with GraphQL and Prisma that then communicates with our Postgres database.

Building Execution Functionality

One core problem we faced when implementing the execution functionality for making API calls came with handling the complexity of the code. We decided to group this functionality into what we call the collection runner.

Here are the steps for making this possible:

  1. A post request is sent to an express endpoint on the collection runner.
  2. Data for the tests that belong to the collection are fetched from the database.
  3. Requests are first processed by interpolating values the user has defined in the test using our reference flag, @{{}}.
  4. The first request is ready to be sent to the specified API endpoint.

Collection Runner Logic

Once a response is received from the API, assertions are checked for the test. If they fail, interested parties are notified. However, if the assertions pass, the process is repeated with the next request.

After this process has been repeated for each request in the collection, the execution phase is complete.

As you can see, this requires complex logic and keeping track of variables that change at different parts of this process. We started looking at implementing a state machine to run collections and keep track of this logic.

A state machine transitions through predefined states;
context values can be updated and used throughout these states.

State Machine

A state machine helps declaratively model application logic. It defines the states an application can exist in and the actions that take the machine from one state to another5. State machines also keep track of context, where values saved to the state machine’s context can be updated by different states. We decided to use this and began moving our complex execution logic to X State, which is a library for creating and using state machines.

Here is how we implemented it:

  1. A post request is sent to an express endpoint on the collection runner.
  2. The Collection Runner Machine moves into the initializing state where a collectionRunId is generated that will be used to later save results to the database. For each test in the collection, child state machines handle the logic of processing the request, sending the request, and making assertions on the response. Each child state machine has its own states and context that are sent back in an event to the parent machine.
  3. The Request Processor Machine is invoked and values for tests that reference previous requests are interpolated.
  4. The Request Runner machine is invoked and makes requests to the specified API endpoint, waits for the response, and passes the data to the parent machine to be saved as responses.
  5. The Assertion Runner machine is invoked, which uses the API’s response to evaluate assertions defined for the current test. Assertion results are then saved to the collection runner’s context.

XState Walkthrough

This process repeats until all tests in a collection have been completed. At that point, the collection runner machine moves to the complete state and the collection run is done.

XState

Let's take a look at how we handled failures. If an assertion does not pass, an error occurs at any point during a state machine’s invocation, or the parent Collection Runner Machine does not receive an event from a child state machine, the parent Collection Runner Machine is notified and enters the failed state. At this point, the remaining tests in the collection will not run and an appropriate error message is sent to any interested parties.

In summary, the collection-runner receives a post request with a collectionId and invokes multiple state machines that go through a sequence of states to complete the tests in a collection. This process allows for the implementation of multi-step tests while organizing the complex logic that comes with that.

At this point, we had implemented a large part of our API monitor’s definition and execution components. Skopos is now able to define and execute multi-step tests but is unable to perform parallel testing, scheduling, or notifying. We decided to solve most of these problems at an infrastructure level.

Building Skopos’ Cloud Infrastructure

Full Architecture

As you might remember, every API monitoring tool needs four key components - definition, scheduling, execution, and notification. Here we take a look at the infrastructure for each of those components and how they fit together for our final architecture.

Definition

definition architecture

Visualizing our cloud architecture in the context of our core API monitoring functionality: our definition component.

The frontend build file is stored in an S3 bucket. The backend is hosted on a docker container on an EC2 instance, fronted by an Elastic Load Balancer. RDS is used for the database.

Execution

Collection Runner

Visualizing our cloud architecture in the context of our core API monitoring functionality: our execution component.

The collection runner is already able to execute multi-step tests, however, we also wanted to execute the tests in parallel. To do this, we needed a way to run multiple instances of the collection-runner in parallel. We explored two ways to accomplish this: a single-tenant and multi-tenant solution.

For a single-tenant solution, we would provision multiple Virtual Machines, each hosting its own instance of the collection runner. Then we would add a load balancer to direct traffic to available collection runner instances. This approach would enable parallel testing, however, it has a few downsides. First, provisioning and managing each Virtual Machine adds complexity. Second, we would be wasting resources because the collection-runner is a relatively lightweight process. It would not utilize the full resources of even the cheapest Amazon EC2 instance.

For the multi-tenant solution, containers are a great fit. One benefit of cloud infrastructure is the ability to take advantage of multi-tenancy, where multiple instances of an application can share the same computing resources6. To implement this approach, we could run multiple instances of the collection runner on docker containers and multiple docker containers could reside in a single machine. Then, to enable parallel testing, a load balancer is used to direct traffic to different containers. Because the single-tenant approach came with significant drawbacks, we decided to use a multi-tenant solution.

We could implement the multi-tenant-approach in two ways: AWS Fargate or AWS EC2 with ECS. Fargate is a serverless container service that abstracts away the complexity of managing Virtual Machines. AWS dynamically scales the fargate instance depending on the demand. However, Fargate’s autoscaling can take up to 15 seconds to spin up new containers, which would increase compute time. It is also a pay-as-you-go service, and because the collection runner makes multiple HTTP requests and waits in between to receive a response, using Fargate would include paying for wait time. For our needs, a better solution was to run containers on EC2 instances. That way, we would be paying for the machine rather than compute time.

To implement this, each EC2 instance houses multiple containers. Although we still need to provision and manage Virtual Machines, we can use those resources more efficiently. In summary, for our solution we hosted multiple instances of collection-runners in docker containers housed in EC2 instances and used a load balancer to direct traffic to different containers.

At this point, we had both the definition and the execution component in the cloud.

Scheduling

Scheduling

Visualizing our cloud architecture in the context of our core API monitoring functionality: our scheduling component.

For scheduling, we considered two main options for implementation: cron jobs and AWS EventBridge.

The first option was to run cron jobs from a Node.js process. This approach would not require any additional infrastructure because the cron job logic could be co-located with the backend. This approach, however, uses coupling functionality that introduces vulnerability. If the node storing the cron jobs were to go down, scheduled tasks stored on that node would be lost. Because cron jobs introduced vulnerability, we then considered AWS EventBridge.

AWS EventBridge is a serverless event bus that can receive and route events based on user defined rules, which can include cron expressions that can be used to trigger events on a schedule. Although using EventBridge adds a component to our architecture, it decouples scheduling functionality from existing processes and prevents a loss of data from potential node failures. Because our aim with Skopos was to provide a reliable API monitoring tool, we decided to use EventBridge.

At this point, we started implementing the scheduling functionality with EventBridge.
However, we found that EventBridge communicates over HTTPS and the collection-runner communicates over HTTP. To allow for communication between the two, we would need to acquire and manage SSL certificates for the collection-runner or use an intermediary between the event bridge and the collection runner. We opted to use a Lambda function as an intermediary. With this implementation, EventBridge communicates with the Lambda function, which then sends a request to the collection-runner.

Notification

Notification

Visualizing our cloud architecture in the context of our core API monitoring functionality: our notification component.

For notifications, we chose to use AWS’ Simple Notification Service, or SNS. SNS has an effective setup for sending notifications to various services, such as PagerDuty and email.

When setting up a schedule, the backend creates a topic and subscribers for the schedule. The user can set up Pagerduty and email links are subscribers to a topic corresponding to the monitor. If a failure occurs, the execution component publishes a message to the SNS topic and the subscribers receive the message.

We also wanted Skopos to send notifications through Slack, however, sending notifications to Slack requires a specific data payload that SNS could not accommodate. To implement this, we chose to send notifications to Slack directly from the collection runner. When the execution component publishes a message to SNS, it would also send a notification to Slack if a user had added a Slack webhook to the monitor.

Conclusion

Full Architecture

Here is an overview of our AWS infrastructure. The frontend build on React is hosted in an S3 bucket. This communicates with the backend using Apollo Server, hosted on a docker container and managed by ECS. The backend communicates with the postgres database, hosted on RDS.

The collection runner instances, which you can see at the bottom of the diagram, are hosted on docker containers running on EC2 instances and managed by ECS. The frontend communicates with the collection runner to execute tests on demand. The collection runner communicates with the backend to fetch the collection data.

Scheduling is hosted on AWS Event bridge. The backend server communicates with EventBridge to create rules for each schedule. EventBridge triggers a Lambda that communicates with the collection runner to execute a collection of tests on schedule.

Finally, notifications are implemented with SNS. When creating a schedule, the backend creates an SNS topic and subscribers. When a failure occurs during execution, the collection runner publishes a message to the corresponding SNS topic and the subscribers are notified of the failure.

Thank you for taking the time to read about Skopos. I hope you enjoyed learning about API monitoring and building a cloud-based API monitoring tool!

We are looking for our next opportunity. If you like our project, have further questions, or think we might be a good fit for your team, please reach out!

Nykaela Dodson
Hans Elde
Katherine Ebel
Gagan Sapkota

Take a look at the full case study here or watch our presentation here

Footnotes


  1. https://twitter.com/GergelyOrosz/status/1502947315279187979 

  2. https://www.splunk.com/en_us/data-insider/what-is-api-monitoring.html 

  3. https://docs.datadoghq.com/synthetics/multistep?tab=requestoptions 

  4. https://www.techtarget.com/searchsoftwarequality/tip/How-and-why-to-do-parallel-testing 

  5. https://deepsource.io/blog/using-state-machine-to-write-bug-free-code/ 

  6. https://digitalguardian.com/blog/saas-single-tenant-vs-multi-tenant-whats-difference 

Top comments (0)