ccarcaci

Posted on Jun 8, 2023

Delayed events pattern, no more crons

#queue #architecture #patterns #crons

Once upon a time, there were cron schedulers. The fundamental bricks to implement the scheduled-task pattern.

The scheduled-task pattern was a useful way to schedule activities and guarantee their execution.

Scheduled activities were of two types:

activities to do periodically (i.e. actualize loan interests every two months)
delayed activities to do after a specified time when an event occurs (i.e. after a week, send a survey to the onboarded user)

Using schedulers, it was possible to fulfill these requirements but a lot of architecture drawbacks and bugs surfaced.

Schedulers drawbacks

Just to list some drawbacks that I've encountered during my career:

error handling, if a cron execution fails, how to recover from where the execution failed?
dependencies between jobs, if job X depends on job Y, how to guarantee that job Y ends before job X?
execution management, if we want to do some operations on data we need to wait until the next cron execution
time zone dependency, be careful to check that each machine that runs a cron job is somehow aligned with the desired time zone
monitoring, how to check what will be processed in the next execution?
high-load on resources, when the scheduled task starts it might require a lot of resources (i.e. high load DB query)
delayed activities that should occur after a fixed time need fine-grained cron execution (i.e. hours/minutes/seconds)

Possible solutions

Solutions shall guarantee:

strong error handling
processing transactionality
processing dependencies explicit into the code
controllable configurations (i.e. time zone), monitoring and resources.

Delayed events pattern

Yes, the proposed solution uses event queues.

This pattern is based on two queues: feed queue and tick queue. Although the latter is not fundamental.

Actors

Event source: the system that generates and publishes the event
Feed queue: the queue that gathers all the events that describe periodic or delayed processing
Delayed trigger: the system that orchestrates the execution of the events
Tick source: the source of truth of time (i.e. time-zone)
Processing queue: the queue where the events falls when the processing should occur

Sequence diagrams

How this pattern works is described in this sequence diagram:

The feed queue is fed with the events that shall be processed somewhere in the future.

The delayed trigger stores all these events in an internal in-memory list (batch). But, by now, it doesn't commit the read message back to the feed queue.

When the delayed trigger receives the tick event, it extracts all the events with timestamp < ticktime in its queue. It is important to note that, by using an external tick queue we're making the time an external dependency that we can control and is an invariant for our system. Imagine you want to debug a specific event that occurred in some specific conditions in the past, now you have all the tools. Having the time as an external argument invariant, the unit tests could guarantee more coverage.

For each event extracted from the internal batch, the delayed trigger publishes it into a processing queue, commits it back to the feed queue and removes the event from the in-memory batch.

The systems that process the delayed event now can read them and execute the operations.

For failure resiliency reasons we're using the explicit event commit configuration (as we will see later in the failure paragraph).

An example of delayed event payload is:

{
  "event-time": "1685913444",
  "event-id": "9e34a177-7ac9-49df-8843-cbf6291cf2c6",
  "payload": { ... },
  "operation": { ... },
}

Periodic activities

It might seem that with this pattern it won't be possible to execute operations on data at fixed time. For example, process all the data at midnight every day.

That's not the case, for two reasons.

First, such limitations are symptoms of bad constraints in the system. For example, when using cron, a recurring question is "when we execute this?". This gives the impression that executing operations at a fixed time is mandatory. But, instead, it is a non-functional requirement. All the times I've faced these constraints, it was possible somehow to improve the product by executing things immediately using a queue or a synchronous API. Using scheduled-tasks was just a common way to redistribute the high load when the system was more idle.

Second, it is always possible to specify the right timestamp in the delayed event (i.e. at the next midnight).

Delayed activities

By using this pattern we don't need to execute a cron job in a short-scheduled period. The event will be executed at the right time.

Delayed trigger failure

In this architecture, we're assuming that queues are fail-safe, and also processing systems are reliable.

The only system that is directly involved with this architecture is the delayed trigger. What does it happen if it fails and doesn't work anymore?

It happens that no events are lost. The feed queue will store all the messages, and, when the delayed trigger instance is back, the uncommitted messages are read and processed at the next tick.

What will be experienced is some delays in the results.

At the same time, by managing wisely partitions and consumers, after the down-time it is possible to re-launch multiple instances of the delayed trigger to cope with the accumulated load into the feed queue. A countermeasure that was not possible (or at least difficult) with the scheduled-task architecture.

Scaling

Since the events are independent each others, this architecture is horizontally scalable. There could be multiple instances of delayed trigger system and several partitions (or clusters) of event queues.

Cron solution doesn't offer this possibility. Processing happens in a single tier.

Technology

The best technical solution to provide the event queues is to use a message-broker technology like RabbitMQ.

Although, there are no limitations in using a pure event streaming technology like Kafka.

In any case, to guarantee failover, the explicit commit functionality should be enabled.

Wrap up

This pattern offers a solution to detach as much as possible the moving parts of scheduled-tasks execution.

The processing is managed by independent and bounded events that live on their own.

The delayed events pattern offers also a solution to make the time an invariant concept to enable the possibility to provide all the information as arguments and provide a high level of coverage with unit tests.

Around this, potentially, it is possible also to monitor the events flow and future schedules just by looking at the queues statuses. This possibility is not possible using cron or schedulers since the execution ran all at once and it is difficult to follow logs to understand what is going on in the process.

DEV Community

Delayed events pattern, no more crons

Schedulers drawbacks

Possible solutions

Delayed events pattern

Actors

Sequence diagrams

Periodic activities

Delayed activities

Delayed trigger failure

Scaling

Technology

Wrap up

Top comments (0)

Read next

Understanding command injection vulnerabilities in Go

Hacktoberfest Writing Challenge Winner Announcement Delay

How a solo dev quickly built and sold his SaaS app for $20k 🏃‍♂️💰

What is Web5?