Abhishek Gupta for ITNEXT

Posted on Jan 9, 2020

Tutorial: Basics of Kubernetes Job and CronJob

#kubernetes #tutorial #showdev #beginners

Welcome to another installment of the "Kubernetes in a Nutshell" blog series 👋👋 So far we covered Kubernetes resources (objects) such as Deployments, Services, Volumes, etc.

In this blog, we will explore Job and CronJob. With the help of examples, you will learn about:

How to use these components
Specify constraints such as time limit, concurrency
Handle failures etc.

The code (lots of YAML 😉) is available on GitHub

Job

You can use a Kubernetes Job to run batch processes, ETL jobs, ad-hoc operations, etc. It starts off a Pod and lets it run to completion. This is quite different from other Pod controllers such a Deployment or ReplicaSet.

As always, we will learn by doing. So, let's dive in!

Hello Job!

Here is what a typical Job manifest looks like:

apiVersion: batch/v1
kind: Job
metadata:
  name: job1
spec:
  template:
    spec:
      containers:
        - name: job
          image: busybox
          args:
            - /bin/sh
            - -c
            - date; echo sleeping....; sleep 90s; echo exiting...; date
      restartPolicy: Never

This Job will simply start a busybox container which simply executes a bunch of shell commands. Let's create this Job and investigate what's going on

To keep things simple, the YAML file is being referenced directly from the GitHub repo, but you can also download the file to your local machine and use it in the same way.

kubectl apply -f https://raw.githubusercontent.com/abhirockzz/kubernetes-in-a-nutshell/master/jobs/job1.yaml

Check the Job and its associated Pod

kubectl get job/job1

NAME   COMPLETIONS   DURATION   AGE
job1   0/1           8s         8s

You should see a Pod in Running state, for e.g.

kubectl get pod -l=job-name=job1

job1-bptmd 1/1  Running

If you check the Pod logs, you should see something similar to this:

kubectl logs <pod_name>

Thu Jan  9 10:10:35 UTC 2020
sleeping....

Check the job again after ~90s

kubectl get job/job1

NAME   COMPLETIONS   DURATION   AGE
job1   1/1           95s        102s

The Job ran for little over 90s and COMPLETIONS relects that one Pod completed successfully. This will reflect in the Pod logs as well

Thu Jan  9 10:10:05 UTC 2020
sleeping....
exiting...
Thu Jan  9 10:11:35 UTC 2020

Also, the Pod status should change to Completed

kubectl get pod -l=job-name=job1

job1-bptmd 0/1  Completed

If all the Job did was to create a Pod to run a container, why cant we use a plain old Pod? That's because a Job can be restarted by Kubernetes if the container fails - that cannot happen with an isolated Pod. In addition to this, there are many other capabilities which a Job Controller provides which we will explore going forward

To delete this Job, simply run kubectl delete job/job1

Enforcing a time limit

For e.g. you are running a batch job, and it takes too long to finish due to some reason. This might be undesirable. You can limit the time for which a Job can continue to run by setting the activeDeadlineSeconds attribute in the spec.

Here is an example:

apiVersion: batch/v1
kind: Job
metadata:
  name: job2
spec:
  activeDeadlineSeconds: 5
  template:
    spec:
      containers:
        - name: job
          image: busybox
          args:
            - /bin/sh
            - -c
            - date; echo sleeping....; sleep 10s; echo exiting...; date
      restartPolicy: Never

Notice that the activeDeadlineSeconds has been set to 5 seconds while the container process has been designated to run for 10s.

Create the Job, wait for a few seconds (~10s) and check the Job

kubectl apply -f https://raw.githubusercontent.com/abhirockzz/kubernetes-in-a-nutshell/master/jobs/job2.yaml
kubect get job/job2 -o yaml

Scroll down to check the status field and you will see that the Job is in a Failed state due to DeadlineExceeded

status:
  conditions:
  - lastProbeTime: "2020-01-09T10:57:13Z"
    lastTransitionTime: "2020-01-09T10:57:13Z"
    message: Job was active longer than specified deadline
    reason: DeadlineExceeded
    status: "True"
    type: Failed

To delete the job, simply run kubectl delete job/job2

Handling failures

What if there are issues due to container failure (process exited) or Pod failure? Let's try this out by simulating a failure.

In this Job, the container prints the date, sleeps for 5s and exits with a status 1 to simulate failure

apiVersion: batch/v1
kind: Job
metadata:
  name: job3
spec:
  backoffLimit: 2
  template:
    spec:
      containers:
        - name: job
          image: busybox
          args:
            - /bin/sh
            - -c
            - date; echo sleeping....; sleep 5s; exit 1;
      restartPolicy: OnFailure

Notice that the restartPolicy: OnFailure is different compared to the previous example where it was set to Never - we will come back to this in a moment

Create the Job and keep an eye on a specific Pod for this job.

kubectl apply -f https://raw.githubusercontent.com/abhirockzz/kubernetes-in-a-nutshell/master/jobs/job3.yaml
kubectl get pod -l=job-name=job3 -w

You should see something similar to below:

NAME                                     READY   STATUS              RESTARTS   AGE
job3-qgv4b                               0/1     ContainerCreating   0          4s
job3-qgv4b                               1/1     Running             0          6s
job3-qgv4b                               0/1     Error               0          12s
job3-qgv4b                               1/1     Running             1          17s
job3-qgv4b                               0/1     Error               1          22s
job3-qgv4b                               0/1     CrashLoopBackOff    1          34s
job3-qgv4b                               1/1     Running             2          40s
job3-qgv4b                               1/1     Terminating         2          40s
job3-qgv4b                               0/1     Terminating         2          45s
job3-qgv4b                               0/1     Terminating         2          51s

Notice how the Pod status transitions

it starts off by pulling and running the container
it transitions to Error state since it exits with status 1 (after sleeping for 5s)
it goes back to Running status again (notice that the RESTARTS count is now 1)
as expected, it goes into Error state again and is restarted once more - RESTARTS count is now 2
finally, its terminated

Kubernetes (the Job Controller to be specific) restarted the container for us because we specified restartPolicy: OnFailure. But there might be a situation where this might continue indefinitely, so we put a limit to this using backoffLimit: 2 which will ensure that Kubernetes re-tries only twice before marking this Job as Failed

Note that this was an example of the container being re-startd. the Job controller can also create a new Pod in case of a Pod failure

If you check the Job status...

kubectl get job/job3 -o yaml

... you will see that its Failed due to BackoffLimitExceeded

status:
  conditions:
  - lastProbeTime: "2020-01-09T11:16:24Z"
    lastTransitionTime: "2020-01-09T11:16:24Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed

restartPolicy of Never means that a failue will not restart the container or create a new Pod when things go wrong. Also, the default limit for backoffLimit is 6

To delete this job, just run kubectl delete job/job3

More is better!

There are requirements where you might want the Job to spin up more than one Pod to get things done. For e.g. consider a scenario where you are running a batch job to process records from a database - having multiple Pods share the load can definitely help.

One way of doing this might be for each Pod to run sequentially, record the no. of rows processed in an external source (e.g. another DB table) and the other Pod can pick up from there. This can be done by adding the completions property in the Job spec

apiVersion: batch/v1
kind: Job
metadata:
  name: job4
spec:
  completions: 2
  template:
    spec:
      containers:
        - name: job
          image: busybox
          args:
            - /bin/sh
            - -c
            - date; echo sleeping....; sleep 10s; echo exiting...; date
      restartPolicy: Never

Create the Job and keep an eye on how it progresses

kubectl apply -f https://raw.githubusercontent.com/abhirockzz/kubernetes-in-a-nutshell/master/jobs/job4.yaml

kubectl get job/job4 -w

You should see something similar to this:

NAME   COMPLETIONS   DURATION   AGE
job4   0/2           3s         3s
job4   1/2           20s        20s
job4   2/2           37s        37s

Since we had the set completions to two

two Pods were instantiated one after the other (sequentially)
Job was marked Completed (successful) only after both Pods ran to completion. Else, the failure conditions would have applied (as discussed above)

Let's check the Pod logs as well

kubectl get pods -l=job-name=job4
kubect logs <pod_name>

If you see the logs for both the Pods, you will be able to confirm that they started one after the other in a sequence (and each ran for ~10s)

Logs for Pod 1

Thu Jan  9 11:31:57 UTC 2020
sleeping....
exiting...
Thu Jan  9 11:32:07 UTC 2020

Logs for Pod 2

Thu Jan  9 11:32:15 UTC 2020
sleeping....
exiting...
Thu Jan  9 11:32:25 UTC 2020

How about running the batch processing in a parallel fashion where all the Pods are instantiated at once (instead of sequentially)? To handle this case, our processing logic needs to be tuned accordingly since there is co-ordination required amongst the parallel Pods in terms of which set of work items to pick and how to update their completion status. We will not dive into that, but I hope you get the idea in terms of the requirement.

Now, this can be achieved by using parallelism along with completions. Here is an example:

apiVersion: batch/v1
kind: Job
metadata:
  name: job5
spec:
  completions: 3
  parallelism: 3
  template:
    spec:
      containers:
        - name: job
          image: busybox
          args:
            - /bin/sh
            - -c
            - date; echo sleeping....; sleep 10s; echo exiting...; date
      restartPolicy: Never

By using parallelism attribute, we were able to put a cap on the maximum number of Pods which can run at a time. In this case, since parallelism is set to three, it implies that:

three Pods will be instantiated all at one
Job will be marked Completed (successful) only of all three run to completion. Else, the failure conditions apply (as discussed above)

Once you're done...

... you can use ttlSecondsAfterFinished to specify the number of seconds after which the Job can be automatically deleted once it is finished (either Completed or Failed). This also removes dependent entities such as Pods spawned by the Job.

CronJob

A CronJob object allows you to schedule Job execution rather than starting them manually. It uses the Cron format to run a job as scheduled. Basically, the CronJob is a higher-level abstraction that embeds within itself a Job template (as seen above) along with a schedule (cron format) and other attributes.

Let's create a simple CronJob repeats every minute

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: cronjob1
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: cronjob
              image: busybox
              args:
                - /bin/sh
                - -c
                - date; echo sleeping....; sleep 5s; echo exiting...;
          restartPolicy: Never

The jobTemplate section is the same as that of a Job. Its simply embedded within this CronJob spec - its the same container which we were using for the Job example.

Create the CronJob and check it:

kubectl apply -f https://raw.githubusercontent.com/abhirockzz/kubernetes-in-a-nutshell/master/jobs/cronjob1.yaml

kubectl get cronjob/cronjob1

The output:

NAME       SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob1   */1 * * * *   False     0        <none>          4s

Keep a track of the Job which this CronJob spawns

kubectl get job -w

NAME                  COMPLETIONS   DURATION   AGE
cronjob1-1578572340   0/1           2s         2s
cronjob1-1578572340   1/1           11s        11s
cronjob1-1578572400   0/1                      0s
cronjob1-1578572400   0/1           0s         0s
cronjob1-1578572400   1/1           10s        10s
cronjob1-1578572460   0/1                      0s
cronjob1-1578572460   0/1           0s         0s
cronjob1-1578572460   1/1           11s        11s

A new Job is being created every minute and it ran for ~10s as expected. You can also check the logs of the individual Pod which the Job created (just like you did with previous examples)

kubectl get pod -l=job-name=<job_name>
kubectl logs <pod_name>

There are other (optional) CronJob properties in addition to the schedule attribute. Let's look at one of these

concurrencyPolicy

It has three possible values - Forbid, Allow and Replace. Choose Forbid if you don't want concurrent executions of your Job. When its time to trigger a Job as per the schedule and a Job instance is already running, the current iteration is skipped. If you choose Replace as the concurrency policy, the current running Job will be stopped and a new Job will be spawned. Specifying Allow will let multiple Job instances run concurrently.

Here is an example:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: cronjob2
spec:
  schedule: "*/1 * * * *"
  concurrencyPolicy: Allow
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: cronjob
              image: busybox
              args:
                - /bin/sh
                - -c
                - date; echo sleeping....; sleep 90s; echo exiting...;
          restartPolicy: Never

You can create this CronJob and then track the individual Jobs to observe the behavior.

kubectl apply -f https://raw.githubusercontent.com/abhirockzz/kubernetes-in-a-nutshell/master/jobs/cronjob2.yaml

kubectl get job -w

Since the schedule is every one min and the container runs for 90 seconds, you will see multiple Jobs running at the same time. This overlap is possible since we have applied concurrencyPolicy: Allow

You might see something like this:

cronjob2-1578573480   0/1                      0s
cronjob2-1578573480   0/1           0s         0s
cronjob2-1578573540   0/1                      0s
cronjob2-1578573540   0/1           0s         0s
cronjob2-1578573480   1/1           95s        95s

Notice that job cronjob2-1578573540 was triggered before cronjob2-1578573480 could finish

The other properties of a CronJob are:

Job History: successfulJobsHistoryLimit and failedJobsHistoryLimit can be used to specify how much history you want to retain for failed and completed Jobs
Start deadline specified by startingDeadlineSeconds
Suspend specified by suspend

That's it for this part of the "Kubernetes in a Nutshell" series. Stay tuned for more 😀 I really hope you enjoyed and learned something from this article 🙌 Please like and follow if you did. Happy to get your feedback via Twitter or just drop a comment 🙏🏻

DEV Community