AWS StepFunction Distributed Map

#watercooler

The last day of re:Invent 2022 still had some nice talks about the newly released features. This one is an extension of the existing Map step that StepFunctions already has.

Why

The existing Map step has some limitations regarding input/output size (256KB), logging the execution history (25000), parallelism (max 40). It is good enough to execute smaller tasks in parallel. If you want larger tasks you can of coarse use S3 as a storage and implement the loading and saving of data yourself in the Lambda function(s) but since this is a quit common pattern AWS decided the extend the functionality of the Map step.

What

When using the new distributed processing mode you can specify a S3 source (either on file, objects with a prefix or the whole bucket). The internals of the Map step will then be treated a a separate workflow.

At the start of the execution of the parent StepFunction the information from S3 is gathered using ListObjectsV2 and then, using the parallellism and batching you specify, the child StepFunction is called. All attributes from ListObjectsV2 will be passed into the child StepFunction.

The maximum parallelism is increased up to 10000 parallel executions of the child StepFunction. The input and output limitations are no longer there. And because the inner StepFunction has its own execution history also this limitation is mitigated.

Other sources than S3

Other sources are currently not supported. In order to use other sources you have to set up some steps in the parent StepFunction to prepare some S3 objects to use as input.

Features

Because the amount of executions can be very large you can specify an error threshold. If the number of errors is lower then the threshold the Map step will still succeed.

In the Map step you can also specify an export location in S3. The results of running the Map step will be reported there (manifest.json, Succeeded.json, Failed.json, Pending.json). You can use a step after the Map step in the parent StepFunction to process these output files to the result you require.

In the console you can follow the progress of your distributed map step.

Use cases

When doing map-reduce like operations on large sets of data this can be a very nice architecture. No need for spark or haddoop instances but a completely serverless solution.

To get a grasp of the size oof the workload you can process using this functionality. In the demo AWS presented a case where over 500000 S3 objects were processed in batches of 250 objects with a paralellism of 3000 parallel lambda executions, each lambda would take 16 seconds to process. The total execution time was under 2:30 minutes.

Kinesis streamed to S3 has a directory structure very suitable to process in this way. If the S3 objects are too large to handle in the lambda function you can use for example the AWS Lambda Powertools to have the lambda process consequtive parts of the objects.

Picture is from a hike we took on the day of arrival in Las Vegas. The location is called Valley of Fire.