AWS Step Functions, a serverless workflow orchestration service offering by AWS, has been around since several years now. Many blog posts (like Using AWS Step Functions State Machines to Handle Workflow-Driven AWS CodePipeline Actions), presentations and learning courses (e.g. Complete guide to AWS Step Functions) have been published showing the capabilities and rich feature set provided.
However not many of them deal with topics related to DevOps tasks -- maybe because Step Functions only offered a limited set of direct service integrations like AWS Lambda until recently. Accessing an AWS API required using for instance an AWS SDK or AWS CLI commands in a script or Lambda function, but this changed a few months ago.
In September 2021 AWS added support for over 200 AWS Services with AWS SDK integration resulting in over 9000 AWS API Actions available. Only a few weeks before, another major enhancement, the new Workflow Studio -- a low-code visual tool for building state machines, had been released so that it is now easier than ever to build workflows -- from simple to complex.
Around the same time, we joined a migration project at a customer who was moving a large application which had been hosted on-premises so far to AWS using services like EC2, RDS, ALB.... Some of the typical operational tasks like managing the database servers are now gone as AWS takes care for the heavy lifting but new ones have arrived and others stay the same.
As the project proceeded, we thought about how we could automate as many operational tasks as possible using native AWS services. AWS Step Functions Service Integrations came right around the corner to make our life much easier. We were able to handle many repeating tasks by creating state machines which are sometimes triggered by scheduled Amazon EventBridge rules or used manually via CLI or Console.
The simple one
AWS Systems Manager`s service integration ssm:describeInstancePatches is used to get the list all patches which will be sent to an AWS SNS topic in order to be delivered to an email inbox of someone who is in charge to check if there might be a conflict ahead with the application requirements.
The Workflow Studio editor makes it quite easy to assemble a workflow and to enrich every step with the required parameters and settings. All service integrations are based on the AWS SDK API calls so that the parameters can be retrieved from the SDK documentation (an example is shown for Systems Manager API).
Workflow Editor allows exporting the state machine definition to a JSON or YAML file so that it can be included into an infrastructure as code project using for instance Terraform.
Information like the EC2 instance ID or the SNS topic ARN can be derived during deploy time using for instance Terraform template variables as shown in the example JSON state machine definition.
The big benefit of using Step Functions is that no custom code and no additional overhead for managing a Lambda function is required to complete this task and the best thing: the state machine is quite intuitive to create, self-documenting and easy to follow and to recap.
The more complex process
Following the sample principles, it is possible to create more complex workflows. The given example shows a workflow which is used to restart all servers belonging to the web app tier which are behind an AWS application load balancer in a rolling manner. No application downtime is required in order to restart them as only a certain number is restarted at once.
In the first step, the alarm actions of some CloudWatch alarms and which should not fire during the restart process and some AWS EventBridge rules are disabled using a Lambda function as the logic to filter these resources needs some custom code.
A property of the AWS Step Functions Map state, the Maximum
Concurrency Control, is used to restrict the number of instances which are deregistered from the ALB target group, followed by a reboot and a final check if the application has been launched successfully before bringing it back into the target group.
Rebooting only a limited number of instances makes sure that the application stays online, and that always enough servers are available to handle user traffic without a significant influence on the user experience.
The new AWS SDK service integrations help again to model the workflow as a sequence of steps must be followed in order to reboot a running instance successfully. Not only has a server to be de-/registered from the target group (among others using elasticloadbalancingv2:registerTargets SDK command) but also to be rebooted (ec2:rebootInstances).
After a certain wait period, an application startup check is performed to make sure that everything is working correctly using a Lambda function as the whole check process requires again some custom logic. Only a healthy and working server should be put back into the ALB target group.
The application requires some minutes to get everything sorted out until it is ready to serve whereby the startup time various depending on factors like external database connections... The Wait state helps in this case to pause the workflow for a certain time. Nevertheless, it can happen that the following startup check fails as the application is not yet ready and another wait period is required.
An in-build "for-loop" feature for Step Functions would be quite helpful in this case to re-run the last two steps (wait + startup check) again. It is possible to model this construct using a Choice state which checks the result return from startup check Lambda function and acts upon it (i.e., go back to the Wait state if the application is not ready yet).
However, this feels somehow clumsy and more like a workaround. Additionally, a break condition (e.g., max. number of checks is required) which introduces a stateful condition which must be passed somehow around or stored somewhere.
Custom Retry and Error Handling for Lambda functions, another cool feature of Step Functions, comes to our rescue. Custom errors which are thrown from a Lambda function can be handled. Depending on the use case, a Catcher or a Retrier for this custom error class might be defined to deal with this situation. The later one is used to simulate a "for-loop" without relying on the Choice state workaround.
Lambda raises a custom InstanceNotYetStartedException in case the health check fails. This exception is handled by a specific Retrier which defines a longer wait interval (120 seconds) to give the application some additional time before the next check. This whole procedure is repeated up to three times in this case until it can be assumed that something went wrong and should be handled otherwise (processing moves on to a States.ALL Catcher which calls a SNS integration step for publishing an alarm).
As a last note to this workflow: the Map state fails as soon as one if its execution has failed. All running inner executions are aborted and all waiting once are cancelled. Care should be taken for this scenario: adding a dead later queue to the inner Map state workflow would be one option, defining a States.ALL Catcher on the Map state level another one or even failing the complete state machine execution by purpose. The best error handling method depends on the workflow requirements. The global Catcher is used in the presented case as some additional steps (putting the deactivated CloudWatch alarms back on place) must be
executed in all cases.
When not to use
Step Functions has some limits like every other AWS service which might prevent one from using it in some rare cases or which requires a workaround. Furthermore, there are external API properties which might not fit to Step Functions. Two examples should shortly be discussed:
Maximum input/output size for a task is 256 KB: AWS API calls might return a lot of JSON data but there are various mechanisms like the filters parameter and pagination support in place to narrow down the scope of a request. Additionally, Step Functions provide output processing functions to extract the data of interest so that this limitation should not be a blocker for most use cases.
How to deal with API calls supporting pagination: many AWS API endpoints return a maximum number of items and an additional NextToken value which can be used to retrieve the next batch with a following call. The clumsy Choice-state construct mentioned above could be used to handle this, but this is not practical. A Lambda function is much more suited in this situation in case a lot of data must be retrieved.
This blog presents use cases for Step Functions which might not be the most common ones out there but proved to be extremely useful. The new SDK integrations have opened a wide field of possibilities to model workflows visually without writing a lot of custom code (even though Lambda is always there if something cannot be solved by in-build mechanisms).
The Step Functions Workflow Studio allows to design and build-up workflows from simple to quite complex ones in an intuitive and rapid way. The ready-to-be-used workflow can be exported to code (is JSON code?) so that a developer's heart does not need to cry and the integration into an infrastructure as code framework can be made.
Some additional features like more intrinsic functions (e.g., string processing) to deal with the sometimes very large JSON results of AWS SKD calls would make working with Step Functions even easier (big point for #awswishlist)