TL;DR
Be wary of threading, locking, and job duration when using Spring Scheduler on an application deployed in Kubernetes to prevent duplicate and delayed job runs. If you’re not too far down the Spring Scheduler route, consider an alternate solution like Kubernetes CronJob.
Background
My pair and I were tasked with implementing two daily jobs in our Spring Boot Application to email a set of users. Before jumping into solutions we outlined our problem constraints:
- Our Spring Boot Application is deployed on Kubernetes
- The scheduled jobs
- Need to run at a specific time
- Have variable runtimes and usually run on the order of minutes
- Share core application implementation and splitting out dependencies would introduce risk and complexity
We narrowed down our solutions to Spring Scheduler and using a Kubernetes CronJob. We decided to start with Spring Scheduler because it was familiar and after adding a couple annotations we were up and running. The two jobs were scheduled to run at 11:00 and we added some logging so we could keep an eye on them:
@Scheduled(cron = "0 0 11 * * *")
public void job1() {
System.out.println("Running job1.");
sendEmails();
}
@Scheduled(cron = "0 0 11 * * *")
public void job2() {
System.out.println("Running job2.");
sendMoreEmails();
}
We pushed the changes and discussed our testing strategy while we waited for our new jobs to run. Our initial test would be to check the logs for entries related to the scheduled jobs. We needed to verify that each job:
- Ran at the configured time (11 AM)
- Ran exactly once
We’ll use a table to track our progress:
Job | Ran at the configured time | Ran exactly once |
---|---|---|
job1 | - | - |
job2 | - | - |
The Problem
When we checked the logs later that day we found that the job had run; however, we saw that there were two log entries for both jobs! We also noticed that the second job (“job2”) had started slightly after 11:00 in both cases:
2023-01-15 11:00:00.000 [ scheduling-1] [pod-1] : Running job1.
2023-01-15 11:00:00.000 [ scheduling-1] [pod-2] : Running job1.
2023-01-15 11:07:42.915 [ scheduling-1] [pod-1] : Running job2.
2023-01-15 11:08:03.792 [ scheduling-1] [pod-2] : Running job2.
Our test results:
Job | Ran at the configured time | Ran exactly once |
---|---|---|
job1 | ✅ | ❌ |
job2 | ✅ | ❌ |
We learned two things from the logs:
- The duplicate entries were from different application instances (based on the pod name in the log context)
- The jobs on each instance were running on the same thread “scheduling-1”
Weird. Down the rabbit hole we go!
Duplicate Job Runs
Since we only had two application instances deployed in our development environment we guessed that it was running the scheduled jobs on each instance. We verified this by increasing and decreasing the number of pods and checking our logs.
A quick search on Stack Overflow told us we weren’t the first ones to encounter the problem of running a scheduled job in an application deployed in Kubernetes. After looking at some solutions and chatting with colleagues we decided to check out a locking solution called ShedLock. Briefly, ShedLock maintains a database table that can be used as a lock amongst application instances. We’re going to gloss over ShedLock configuration because many folks have already covered it in depth. Once configured, the first application instance to run the job will create a lock to prevent any other instance from running the same job. When the job is done or the max lock time is reached, the lock is released.
Here’s what our job looks like with the ShedLock annotations (“@ScheduledLock”), minimum lock time (“lockAtLeastFor”), and maximum lock time (“lockAtMostFor”):
@Scheduled(cron = "0 0 11 * * *")
@SchedulerLock(name = "job1", lockAtLeastFor = "PT5m", lockAtMostFor = "PT10m")
public void job1() {
System.out.println("Running job1.");
sendEmails();
}
@Scheduled(cron = "0 0 11 * * *")
@SchedulerLock(name = "job2", lockAtLeastFor = "PT5m", lockAtMostFor = "PT10m")
public void job2() {
System.out.println("Running job2.");
sendMoreEmails();
}
Note: We’ll come back to the “lockAtLeastFor” and “lockAtMostFor” values in a bit…
We deployed the changes and checked the logs again:
2023-01-16 11:00:00.000 [ scheduling-1] [pod-1] : Running job1.
2023-01-16 11:00:00.000 [ scheduling-1] [pod-2] : Running job2.
2023-01-16 11:08:03.206 [ scheduling-1] [pod-2] : Running job1.
Duplicate logs again! However, this time it was only for one of the jobs - progress!
Job | Ran at the configured time | Ran exactly once |
---|---|---|
job1 | ✅ | ❌ |
job2 | ❌ | ✅ |
Lock Duration
My pair and I were stuck so we drew a diagram to plot the series of events:
Both jobs started at the same time and both locks were enabled. When “job1” finished, the lock was released. Meanwhile, “job2” was still running on the second instance with “job1” queued. Once “job2” finished, “job1” was dequeued and it was able to start because the lock was already released.
The solution for the lock being too short is to increase it to be longer than the other job. So the minimum lock time (“lockAtLeastFor”) for “job1” should be longer than “job2” would take and vice versa. Since our jobs were only being run once a day we were able to be liberal with the lock times - we locked it for a minimum of 15 hours and a maximum of 20 hours.
@Scheduled(cron = "0 0 11 * * *")
@SchedulerLock(name = "job1", lockAtLeastFor = "PT15h", lockAtMostFor = "PT20h")
public void job1() {
System.out.println("Running job1.");
sendEmails();
}
@Scheduled(cron = "0 0 11 * * *")
@SchedulerLock(name = "job2", lockAtLeastFor = "PT15h", lockAtMostFor = "PT20h")
public void job2() {
System.out.println("Running job2.");
sendMoreEmails();
}
Now our timeline looks like this:
Locking for 15 hours is overkill for our eight minute job but my pair and I agreed that in general the lock times should be maximized to the frequency to reduce the chance of them running multiple times. Our take is that if your job runs every minute then it should be locked for 59 seconds, if it runs every hour then lock it for 59 minutes, and so on.
We deployed and checked the logs again:
2023-01-16 11:00:00.000 [ scheduling-1] [pod-1] : Running job1.
2023-01-16 11:00:00.000 [ scheduling-1] [pod-2] : Running job2.
Job | Ran at the configured time | Ran exactly once |
---|---|---|
job1 | ✅ | ✅ |
job2 | ✅ | ✅ |
Scheduled Jobs on Same Thread
We had solved the duplicate runs with locking but my pair and I were still hung up on the jobs running on the same thread. Our solution wasn’t foolproof because adding another job would still cause a job to be queued behind another. This means that a third job would run after another rather than at the scheduled time:
More generally, the queueing issue occurs whenever there are more scheduled jobs than application instances. After reading some thrilling Spring documentation we learned that Spring Scheduler is configured to use a single thread by default. The fix was to increase the threads allocated to the scheduler:
@Configuration
public class SchedulingConfigurerConfiguration implements SchedulingConfigurer {
@Override
public void configureTasks(ScheduledTaskRegistrar taskRegistrar) {
ThreadPoolTaskScheduler taskScheduler = new ThreadPoolTaskScheduler();
// Set thread pool size to 10
taskScheduler.setPoolSize(10);
taskScheduler.initialize();
taskRegistrar.setTaskScheduler(taskScheduler);
}
}
Now that the scheduler has more threads we can avoid queuing jobs and each will start at the configured time. Once a job has started, the lock is put in place preventing any other application instance from starting the same job:
My pair and I agreed that the safest route was to configure locking and increase the number of threads. By using both solutions, we minimize the chance of any job being run more than once or not being run at the correct time.
Conclusion
My pair and I learned a lot about the pitfalls of Spring Scheduler and how running jobs in Kubernetes introduces new complexity. Between threading, locking, job duration, and application instances there are a lot of potential snags that you may hit along the way. Hopefully our discoveries help at least a couple folks navigate implementation.
References
- https://stackoverflow.com/a/49533618/20514850
- https://levelup.gitconnected.com/solving-multiple-executions-of-scheduled-tasks-across-multiple-nodes-even-with-shedlock-spring-2b1d26db9356
- https://www.baeldung.com/shedlock-spring
- https://dhananjay4058.medium.com/lock-scheduled-tasks-with-shedlock-and-spring-boot-f67200dad675
- https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/
Top comments (0)