DEV Community

Yan Cui for AWS Heroes

Posted on • Originally published at theburningmonk.com on

The biggest problem with EventBridge Scheduler and how to fix it

The launch of EventBridge Scheduler was one of the highlights for me for re:Invent 2022. Finally, we have a scalable service that lets us schedule ad-hoc, one-off tasks in a serverless way!

For longtime followers of my work, you might have read “Serverless Architectures on AWS, 2nd Edition”. In the book, I spent an entire chapter showing you five ways to implement a similar service and discussed the different considerations for such a service:

  • Precision: how close to the scheduled time is the task executed?
  • Scalability (number of open tasks): can the service support millions of tasks that are scheduled but not yet executed?
  • Scalability (hotspots): can the service execute millions of tasks at the same time?
  • Cost

The chapter teaches you about architectural design and how to think about (and manipulate) trade-offs by walking you through five different implementations. While the lessons from this chapter are still relevant, the implementation ideas are largely superseded by EventBridge Scheduler. Unless you require millisecond-level precision, there is no good reason to build a custom solution anymore.

Having said that, EventBridge Scheduler still has a big problem.

At the time of writing, one-off schedules are not automatically deleted after they have been executed.

This is a problem because:

  1. It pollutes the control plane with lots of expired schedules that will never be executed again. It makes iterating through and finding relevant schedules more difficult.
  2. More importantly, there is an initial limit of 1,000,000 schedules per region per account. See the official quotas page for EventBridge Scheduler.

Even the official quotas page says “We recommend deleting your one-time schedules after they’ve completed…”. It’s a shame there is no support for automatic deletion at this point.

To me, this is the biggest problem with using EventBridge Scheduler for executing one-off tasks right now. It is exactly what I described as the “Scalability (number of open tasks)” criteria above.

The fix

Luckily, this is a problem that we can solve with relative ease.

I saw a blog post from Pubudu Jayawardana on how you can solve this problem using Step Functions.

It’s a clever idea and I like it. But a simpler and cheaper solution would be to use Lambda Destinations instead.

When EventBridge Scheduler invokes the target Lambda function, it does so via an asynchronous invocation. This means we can use Lambda Destinations (which doesn’t support synchronous invocations) to trigger the cleanup step and delete the schedule.

You can see an example of this in this demo repo.

For this to work, the onSuccess function needs to know the name of the schedule. It’s the only piece of information you need to delete a schedule, as you can see from the code snippet below.

const Scheduler = require('aws-sdk/clients/scheduler')
const SchedulerClient = new Scheduler()

module.exports.handler = async (event) => {
  const name = event.requestPayload.name

  await SchedulerClient.deleteSchedule({
    Name: name
  }).promise()
}
Enter fullscreen mode Exit fullscreen mode

Luckily, we just need to make sure the target Lambda function (for the schedule) receives the name of the schedule as part of its invocation event. Because the onSuccess function would receive this as requestPayload when it’s invoked by the Lambda service, as you can see from the trace collected in Lumigo:

You can see how these fit together in my demo repo. In the repo, this is the API Gateway function that creates the schedule:

const Scheduler = require('aws-sdk/clients/scheduler')
const SchedulerClient = new Scheduler()
const uuid = require('uuid')

const { EXECUTE_ARN, ROLE_ARN } = process.env

/**
 * 
 * @param {import('aws-lambda').APIGatewayEvent} event 
 * @returns {Promise<import('aws-lambda').APIGatewayProxyResult>}
 */
module.exports.handler = async (event) => {
  const name = uuid.v4()
  const resp = await SchedulerClient.createSchedule({
    Name: name,
    ScheduleExpression: `at(${event.body})`,
    FlexibleTimeWindow: {
      Mode: 'OFF'
    },
    Target: {
      Arn: EXECUTE_ARN,
      RoleArn: ROLE_ARN,
      Input: JSON.stringify({
        name
      })
    }
  }).promise()  

  return {
    statusCode: 200,
    body: resp.ScheduleArn
  }
}
Enter fullscreen mode Exit fullscreen mode

Note the name of the schedule is passed along in Target.Input. This input then becomes the invocation event for the target Lambda function.

module.exports.handler = async (event) => {
  // the name of the schedule
  // is captured in event.name
}
Enter fullscreen mode Exit fullscreen mode

And it’s passed along to the target Lambda function, and then eventually to the onSuccess function. Which is used to delete the schedule from EventBridge Schedule.

Wrap up

I hope you have found this article useful and helps you make better use of EventBridge Scheduler. It’s one of the most exciting services that AWS has launched in recent years. If you are using it already or are thinking about using it, then please let me know via Twitter or LinkedIn what

I also want to thank Pubudu for sharing his idea of using Step Functions, it gave me the inspiration to write up my thoughts and share them with you.

If you want to learn more about building serverless architecture, then check out my upcoming workshop where I would be covering topics such as testing, security, observability and much more.

Hope to see you there.

The post The biggest problem with EventBridge Scheduler and how to fix it appeared first on theburningmonk.com.

Top comments (5)

Collapse
 
wliew99 profile image
wliew99

Hey, did you try to set the when creating a scheduler, e.g.
RetryPolicy: {
MaximumRetryCount: 3,
}

It doesn't seem to have an impact on the schedule created as it's still default to 185 retries. Just thought you might know the fix to this....
wendy

Collapse
 
theburningmonk profile image
Yan Cui

If you're talking about this RetryPolicy: docs.aws.amazon.com/AWSCloudFormat...

then the problem is you have a typo. The field should be called "MaximumRetryAttempts" instead of "MaximumRetryCount", which would explain why it's not working.

I wish CloudFormation would reject the template when it has typos like this though :-/

Collapse
 
wendyliew99 profile image
wendyliew99 • Edited

Thank you for this. So we just delete our schedule once done? Do we get cloud watch log that it’s finished successfully?

Collapse
 
theburningmonk profile image
Yan Cui

Yes you can! But it depends, there's a limit on the no. of schedules you can have, so if you have a large no. of schedules all finished around the same time, and then a large no. of schedules all being created right after, then there's a time window where you are more likely to hit that limit and are unable to create new schedules until the old ones are cleaned up. The limit is sufficiently high (1M by default) so it shouldn't be an issue. But also keep in mind that, this cron approach would have the usual trappings of batch jobs - taking too long to run and timing out and no-one noticing, etc.

Collapse
 
wendyliew99 profile image
wendyliew99

I tried this out and apparently the if you create your schedule under a non default group, you will need to issue the delete schedule with the name as well as the group name.