AWS Step Functions - Refactor to native integrations with AWS API

#aws #stepfunctions #lowcode #lambda

In this article, I will describe how I got rid of two Lambda functions in my Data Lake loading process.

All thanks to the change announced in October 2021, Step Functions received native API integrations with over 200 AWS services. Yep, I call them native API integrations because for me, it's cloud-native low-code solution. AWS refers to them as AWS SDK Service Integrations, which is also fine 😉

Anyhow, I must admit that I waited for this particular update since summer.

What has changed in Step Functions?

Some prelude to the update has been available for a long time. I'm talking about this new way of calling Lambda functions from Step Functions. The difference is only visible in a language defining the machine. In the old configuration, we defined a Lambda call like that:

StateName:
  Type: Task
  Resource: !GetAtt functionName.Arn

and in the new way like that:

StateName:
  Type: Task
  Resource: arn:aws:states:::lambda:invoke
  Parameters:
    FunctionName: !GetAtt functionName.Arn
    Payload.$: $

At first glance, you can notice differences. Instead of your own resource, we use a universal client in the Resource field, and provide the function name as a parameter.

This change to us - users - had no significance, both excerpts are equivalent. However, I'm eager to speculate that it was a milestone step on the way to providing the native integration of Step Functions with other AWS services via API.

From now on, we have access to 200 different integrations that we call similarly:

StateName:
  Type: Task
  Resource: arn:aws:states:::aws-sdk:<service>:<action>
  Parameters:
    ParametrApi1: value
    ParametrApi2: value

AWS SDK - feels like home 😃

In this way, you can copy a file from S3 (arn:aws:states:::aws-sdk:s3:copyObject), run the EC2 instance(arn:aws:states:::aws-sdk:ec2:startInstances), and so on. There are myriad possibilities.

But how much does it cost?

And here is the biggest deal: it costs nothing!

I repeat: it's free 😃

I'm talking about the SDK API call (if it is free, and most of them are). For the state machine and the operation of the invoked service, of course, we pay as we used to be.

If it does not cost anything, then it is worth replacing Lambda functions that invoke other services with native integration. Thanks to that, we're going to save on Lambda cost, but also in terms of implementation (developer's time).
Because, as always: No code is the best code.

And how much does it actually take to refactor?

The over mentioned developer's time-savings, I must admit, was not so obvious. When I refactored two functions to native integrations, it took me 3 hours. A lot of time, but we all know that:

That's how it looked in my case because AWS API does not response with exactly what I needed at the moment. Variable names returned by AWS API must be processed, which is difficult, as there is no Lambda function at your disposal. And I didn't read the documentation 🤣

Refactoring

In my case, I changed the left implementation to the right one. I remind you, the entire process loads data to a Data Lake, hence references to the Glue service. In your case, this may be any other AWS service.

The entire state machine is called with a Payload, where the Glue Crawler name is given. The former Lambda function expected a variable named crawlerName, while API method named StartCrawler requires Name.

That was not so difficult to solve:

StartCrawler:
  Type: Task
  Resource: arn:aws:states:::aws-sdk:glue:startCrawler
  Parameters:
    Name.$: $.crawlerName

However, a big problem for me was to embrace the output data from this step because startCrawler does not return any data, and by default, Step Functions relays the current result to the next step. Two steps further, GetCrawler will need a Crawler name to check a condition (Has Crawler finished running?).

I solved the problem by reaching for a global context of the state machine and using the ResultSelector filter to build my own object at the state exit.

StartCrawler:
  Type: Task
  Resource: arn:aws:states:::aws-sdk:glue:startCrawler
  Parameters:
    Name.$: $.crawlerName
  ResultSelector:
    crawlerName.$: $$.Execution.Input.crawlerName

This is just an illustrative example and does not include error handling and mapping them to the exit state.

That mapping realized what I required, an exit state with a variable crawlerName and proper value.

The next step was to get the current state of the Crawler using the GetCrawler method in the loop, that lasts until it is done. I used a similar solution here as before. The value of state is used by the condition (Has Crawler finished running?) in a loop, and crawlerName is used as an input parameter for the next state GetCrawler. Crawler.State comes from the results returned by the API getCrawler.

ResultSelector:
  crawlerName.$: $$.Execution.Input.crawlerName
  state.$: $.Crawler.State

Was it worth changing?

Of course, it was!

Firstly, no code is the best code. Here, we replace the custom Lambda with declarative definition of states. This means that any person in the world knowing Step Functions will understand how it works, without deep-diving into our beautiful Lambda code.

Secondly, practice makes perfect. ~~Juggling~~ Mapping values on the output is only difficult at the beginning. When you figure out how this works, you get more efficient in writing them. I hope this article will help you avoid many problems I have encountered 🙂

Thirdly, native integrations are so versatile, that from now on, practically every state machine you will use them. This means that the investment in mastering them will definitely pay off in the future projects.

Fourthly, the declarative code of the states is easier to reuse (copy-paste) than the Lambda code.