Frédéric Barthelet for Serverless By Theodo

Posted on Dec 1, 2022

9 Surprises using AWS EventBridge Scheduler

#serverless #scheduler #aws #task

AWS released its news AWS EventBridge Scheduler service, dedicated to planing tasks in your application. The service is available on all regions using the SDK, the CDK, the CLI and the web management console.

Since the service was released, I've been thoroughly migrating existing workloads that leveraged homemade scheduling features using DynamoDB TTL or any other contraption. This article serves as a discovery report, describing the good, the bad and my general recommendations when it comes to using this new serverless scheduling managed service, in an effort to save someone else troubles understanding the use cases and limits of the service.

What is the Scheduler

The Scheduler is an AWS managed service, dedicated to scheduling one-time or recurring triggers targeting AWS services actions.

There are 3 types of schedules:

One-time schedules - December 21st at 7AM UTC
Rate-based schedules, allowing recurring tasks using frequency rate - every 2 hours
Cron-based schedules, allowing recurring tasks using a cron expression - every Friday at 4PM

Schedules can be grouped in Schedule Groups. Both Schedules and Schedule Groups can be provisioned using a CRUD API on the service.

Rate-based and cron-based schedules can be triggered in a specific timeframe using a start date and an end date. One-time and cron-based schedules are sensible to an optional timezone parameter in which the schedule expression should be evaluated.

Each schedule can trigger one target. There are 3 types of targets:

Templated Isomorphic targets
Templated Service-specific targets
Universal targets

Unlike Universal targets that require the Scheduler to bootstrap an execution environment to use the AWS SDK, both Templated Isomorphic and Service-specific targets are most likely leveraging EventBridge API Destination features to interact with the destination AWS services using their HTTP interface.

Templated Isomorphic targets

You can trigger the following services APIs using the same generic target definition at the creation of your schedule:

The target definition only requires an arn (like the one of the Lambda function to be invoked or the Step Functions state machine to be started). It allows the use of an optional input field.

Templated Service-specific targets

You can trigger the following services APIs using additional service-specific options at the creation of your schedule:

Like for templated isomorphic targets, the schedule definition only requires an arn. In addition to the optional input field, one service specific attribute can be added, named ${service}Parameters to the schedule target definition in order to further configure the action. For exemple, you can provide a SqsParameters parameter in order to specify MessageGroupId value for the SQS SendMessage target.

Universal targets

Universal targets allows schedules to target any AWS service and any corresponding action using an abstracted compute environment with the SDK capabilities. Universal target are defined using the magic string arn:aws:scheduler:::aws-sdk:${service}:${apiAction} as an arn. This feature is similar to the AWS Step Functions SDK services integration released in September 2021.

The good surprises 🎉

The Scheduler is right on time !

It's not clearly written in the documentation, but Marcia Villalba release post mentions a granularity of one-minute. We can safely assume schedules precision is within a 60 seconds margin when flexible time window is disabled. In practice, running tests over 10.000 data points, using both Universal and Templated Targets aiming at invoking the same Lambda function, the results show both modes successfully trigger within 50 seconds.

In addition, delta between scheduled time and invocation time has a roughly unified repartition from 0 to 50 seconds

Being right on time with a precision of a minute is a huge step forward compared to the 48 hours guarantee on DynamoDB TTL expiration. Of course this DynamoDB garbage collection feature was never intended for precise scheduling, but measured results were encouraging and a lot of application still relied on this mechanism. Lately, delta has considerably increased and the Scheduler release felt like a blessing!

If you require a scheduling mechanism with a precision to the exact second, have a look at the CDK Scheduler.

Authorization has a per-Schedule granularity

The EventBridge Scheduler allows use of a different role for each Schedule, similar to EventBridge Rules current behavior. Granularity with minimal policy documents is therefore easily enforceable and misconfiguration can be avoided thanks to thorough permission management.

This Schedule specific role should include permissions for the targeted service and its corresponding action. This role should also be assumable by the Scheduler service.

Since one-time Schedule will mostly be provisioned at runtime (like for instance to send a reminder email to a user 10 days after its initial connection), please note that the role assumed by the compute unit in charge of creating the Schedule should:

allow schedule:CreateSchedule action
allow iam:PassRole action for the role to be used by the Schedule

Schedules access patterns are relevant

Schedules can be grouped in Schedule Groups (they are by default created in a group conveniently named default). Schedule ARNs are predictable, no technical IDs are involved in the management of Schedules and Schedule Groups. A Schedule ARN follows this syntax: arn:aws:scheduler:${region}:${accountId}:schedule/${scheduleGroupName}/${scheduleName}.

Listing existing Schedules with the ListSchedule action allows multiple access pattern:

list all Schedules
list all Schedules in a specific Schedule Group
list all Schedules whose name has a specific prefix
list all Schedules in a specific Schedule Group and whose name has a specific prefix

This allow for clear business logic separation like in multi-tenant applications, ensuring no collision occurs within code when it comes to handling Schedules dedicated to a single tenant. It is strangely resembling a composite primary key access pattern on a DynamoDB Table, where Schedule Group names officiate as separate partitions and Schedules as distinct items who's name is the sort key.

Get, Update and Delete actions require however an exact identifier - Name and GroupName (which is equivalent to providing the ARN).

Scheduler is protected against recursive calls

Universal targets leverage a dedicated compute unit to execute SDK actions for a specific Schedule. Not all SDK services and actions are included in this environment. For instance, all actions of the Scheduler are excluded, preventing unintended recursive calls that may break the bank! I did however had a lot of fun trying to create a recursive Schedule targeting arn:aws:scheduler:::aws-sdk:schedule:createSchedule with the same payload.

The not so good surprises 🤯

Schedules remain visible after their job is done

The Scheduler does not distinguish still-relevant and irrelevant Schedules. What I call irrelevant Schedules are:

one-time Schedules who's scheduled date has passed and target was successfully invoked
recurring Schedules who's end date has passed and target was successfully invoked on all occurrences
deactivated Schedules. Those are the only irrelevant Schedules that can be identified and filtered out of listing operations

Those Schedules are indeed irrelevant since there is no remaining tasks associated with them. Except for debugging purpose, they have no remaining impact on the overall application behavior.

Keeping those irrelevant Schedules around induces various problems.

Irrelevant Schedules count towards the per region quota of 1 million Schedules. While this quota can be increased, any limitation impacting an application history (formerly, all Schedules that were ever created in the context of a specific application) is doomed to be a critical problem at some point. Remember disk space storage issues induced by endlessly writing application logs? We're finding ourself in the exact same situation here.

In addition, no validation occurs at Schedule creation to ensure newly created ones are not already irrelevant at the time of creation.

Finally, there is no efficient way to list remaining relevant Schedules at any time on a given workload.

Templated Targets `input` field mapping is highly inconsistent

The optional input field that you can use on templated targets is highly inconsistent. I had to experiment quite a lot with schedules to be able to produce the following few mappings with API reference documentation:

EventBridge – PutEvents input will be mapped to Entries[0].Detail field
Kinesis Data Firehose – PutRecord -> input will be mapped Record.Data field
Lambda – Invoke -> input will be mapped to Payload field
Amazon SNS – Publish input will be mapped to Message field
Amazon SQS – SendMessage input will be mapped to MessageBody field
Step Functions – StartExecution input will be mapped to input field

The Scheduler supported action list is inconsistent

Actions relative to Schedules are referenced with schedule:${action}, while actions relative to Schedule Groups are referenced with scheduler:${action} in policy documents. Small detail here, but can be really troublesome the first time you write a policy document to use for the Scheduler. You can have a full list of all actions in the Scheduler documentation.

Update action has a replace all strategy

Missing optional fields in an update statement are replaced with their default value. Updates on Schedules has unintended behavior if any value that should remain unchanged are not provided in the payload.

Prefix attribute for ListSchedule action regex does not match name attribute regex

Schedule name has the following regexp: ^[0-9a-zA-Z-_.]+$
Listing Schedules using a name prefix filter only accepts an argument following this regexp: ^[a-zA-Z][0-9a-zA-Z-_]*$

Long story short: you can only use name prefix filter access pattern for Schedules who's name starts with an alphabetical character. I initially designed my one-time Schedules name to start with ISO8601 representation of the scheduled date to circumvent irrelevant Schedules issue. This proved to be a wrong design intent since all ISO8601 representations start with a number, and cannot therefore be used as prefix attribute in a ListSchedule operation.

Should you use the AWS EventBridge Scheduler?

You're currently using CloudWatch Rules or EventBridge Rules

Rules using cron-based and rate-based schedules should be migrated to EventBridge Scheduler. You'll be able to reach more target types than with existing EventBridge target catalog. You'll also be able to remove a few Lambda functions who's sole purpose was to use the SDK to reach a service not integrated with EventBridge. Finally, the Scheduler has 14 million Schedules included in its free tier each month, your application may not use as much and you'll remain free of charge after migrating to this new service.

You're currently using DynamoDB TTL

Using DynamoDB TTL to schedule one-time tasks can now be definitely deprecated. The pricing impact of removing DynamoDB and Lambda altogether from the required infrastructure to implement such scheduling mechanism is worth it. Even if you're currently fine with the 48 hours window of DynamoDB TTL, you should rely on the Scheduler with the corresponding flexible time window parameter.

Key take-aways

Always use Universal Targets

Indeed, universal targets have quite a few advantages:

📚 Targets catalog All templated targets can be achieved with universal targets. Universal targets cover almost all AWS services and actions.
👨‍💻 Developer Experience You can safely rely on targeted service actions documentation instead of hoping you're aiming for the correct field using input shortcut provided by templated targets. Your schedule definition will be consistent and self-explanatory since they won't be relying on EventBridge Scheduler specific shorthand syntax.
⚙️ Configurability You can use all allowed configuration options for the action you want to trigger.
💶 Cost There is no additional charges to use universal targets, it costs the same while doing much more!
🏃 Schedule precision Unlike what I initially assumed, universal targets are slightly closer to the requested scheduled time (considering p90 delta).

Provision at the right time

Schedules and Schedule Groups can be created at deploy time, using any IaC framework, or at runtime, using the SDK. A few recommendations regarding when to provision which:

Almost always, Schedule Groups should be provision at deploy time.
Schedule Groups used for tenancy segregation in multi-tenant applications are the only groups that should be provisioned at runtime, at the time of tenant creation.
Recurring Schedules without start and end dates are relevant throughout the entire lifespan of an application, they should be provisioned using IaC at deploy time.
One-time Schedules and recurring Schedules with a given timeframe should be created at runtime, resulting from a user action, within their respective previously provisioned Schedule Groups

If you need to use UpdateSchedule actions, always use GetSchedule beforehand as starting point for your command payload

Indeed, update actions in EventBridge Scheduler use a replace all attributes strategy. If you omit a value that was previously given (at creation or previous update), the Schedule will use the default value for the corresponding missing attributes. This can lead to unexpected behavior.

Prefer the use of UTC

Timezone management is a pain, always. The Scheduler tries to compensate with an optional timezone parameter and implements daylight savings time shift.

In most cases, if you want to avoid timezone strange behavior, prefer relying on a date management library to convert one-time Schedules scheduled time in UTC before creating it.

Cron-based Schedules are the only relevant Schedules that might benefit from timezone sensitive settings.

Rate-based Schedules are unaffected by this setting.

Always use a DLQ on your Schedules

If it can fail, it will fail at some point. CloudWatch metrics available for the Scheduler cannot distinguished failed target invocation on a per-Schedule basis. Provisioning a dead-letter queue and referencing it as destination for all your schedules is a must have!

Implement a regular cleanup process for irrelevant Schedules

Regular cleaning of now irrelevant Schedules should be implemented to keep the total number of Schedules under control and avoid reaching the 1 million quota for the service. You can rely on a rate-based Schedule to regularly invoke a Lambda function dedicated to listing and deleting irrelevant Schedules. You can adjust the retention period, for debugging purpose, for which you still want to keep a Schedule around by changing your programmatic filtering parameters.

Conclusion

All things considered, the new AWS EventBridge Scheduler service feels like a blessing, especially for one-time Schedules where there were no robust alternatives on AWS. Google Cloud Platform had Cloud Task since 2018 for this specific purpose. It's nice to see AWS matching the offer and providing, almost for free, a dedicated managed service with precise scheduling mechanism.

At the time of publishing this discovery report, some questions remain unanswered. Among the various subjects I'll dig into, but save the findings for a separate article, you'll find:

why and when to use the flexible time window parameter
why and when to use the client token. How can you ensure idempotency when you interact with the Scheduler
what kind of L2/L3 CDK Construct can and should be implemented to ease up integration of this service