Odin is a programmble, observable and distributed job orchestration system which allows for the scheduling, management and unattended background execution of individual user created tasks on Linux based systems.
The primary objective of such a system is to provide users/teams a shared platform for jobs that allows individual members to package their code for periodic execution, providing a set of metrics and variable insights which will in turn lend transparency and understanding into the execution of all system run jobs. Odin aims to do this by changing the way in which we approach scheduling and managing jobs.
Job schedulers by definition are supposed to eliminate toil, a kind of work tied to running a service which is manual, repetitive and most importantly - automatable. Classically, job schedulers are ad-hoc systems that treat it’s jobs as code to be executed, specifying the parameters of what is to be executed, and when it is to be executed. This presents a problem for those implementing the best practices of DevOps. DevOps is something to be practiced in the tools a team uses, and when traditional job schedulers fail they introduce a new level of toil in debugging what went wrong.
Odin treats it’s jobs as code to be managed before and after execution. While caring about what is to be executed and when it will be executed, Odin is equally concerned with the expected behavior of your job, which is to be described entirely by the user’s code. This observability can be achieved through a web facing user interface which displays job logs and metrics. All of this will be gathered through the use of Odin libraries (written in Go, Python and Node.js) and will help infer the internal state of jobs. For teams, this means Odin can directly help diagnose where the problems are and get to the root cause of any interruptions. Debugging, but faster!
If you'd like to check out a brief demo of the Odin system, I recorded a short five minute video as it was a criteria of submitting Odin as my final year Project to Dublin City University.
You can check out that demo here.
You can check out the source code for Odin out here on GitHub.
There is also a
final-year-project branch which you may view. This branch contains all materials I submitted alongside Odin as my final year project in Dublin City University.
Let's get down to brass tacks. The development cycle for the project was pretty standard, so in this section I'm going to touch on some of the research conducted going into the project, some interesting issues/challenges posed during development and finally I take a look at the ultimate design and implementation of the Odin Job Scheduler.
When you're aiming to build a distributed scheduling tool for teams practicing DevOps, it's only right you consult this chapter in Google SRE Handbook.
Obviously SRE and DevOps are not one and the same, but given the close association of the two, we deemed the consultation of this material entirely relevant. The chapter linked above specifically details two approaches to scheduling managers, those that store state and those that use remote data stores to hold information. The chapter details the advantages and disadvantages of both, with specific emphasis on the dangers of relying on a single source dependency such as a remote data store. From our reading of this chapter we set sights to ensure that Odin, which would utilize a remote data store, would be able to manage its own execution state in the event the data store went offline.
When considering the role of observability in the Odin ecosystem, I felt the following two resources to be integral to building an understanding:
The former of these writings, by observability thought leader Charity Majors, focused on the cultural aspect of observability within teams, and this writing really focused on how observability is a prerequisite for any major effort to have saner systems. Observability allows built-in transparency to the user in the instance of Odin, so these writings allowed us to broaden our scope in regards to what observability in a system could it achieve.
The latter of these writings really gives a direct insight into how observability systems work, with direct data collectors/loggers which run information through concurrent pipelines into the data stores. This confirmed that the correct approach in designing our system was for the language specific software development kits to extract live information about jobs, and then plugging this into a back-end of a visualization unit.
Finally we consulted the following two academic papers in relation to the algorithmic side of job scheduling, along with performance techniques to enhance the rate of execution:
Generally I consulted with much, much more material outside of the linked appendices. These are just the ones which I felt helped me the most during the research stage.
Every project has it's hurdles to overcome. Some big, some small. The following three scenarios are problems I confronted during the development stage.
Originally the jobs queue was a simple array of nodes, with each node representing a job. One attribute of each node was that node’s time until execution. When the time left was found to be zero, the job would be sent to the executor and the ticker would check the next job in the queue.
The issue with this design was that at any one time, there could be several nodes that had “zero” seconds until execution. This would mean that by the time one job had been sent to the executor, the next job had been rescheduled as it had strayed into a negative range.
The diagram below demonstrates the example of such a queue:
Job 4, Job 2 and Job 1 would all be set to execute at the same time, but only Job 1 would execute successfully, as it was the first job added of these three. The same issue will rise its head again when both Job 3 and Job 5 come to a Seconds Left value of 0 also.
The solution to this problem was simple - groups. Instead of one job per node, we instead tried one Seconds Left value per node. This means that nodes in the queue were not representative of just one job, but rather of all jobs which were set to execute at the same time. This approach meant that this example of an old Odin Priority Queue:
Would be transformed into the new queue design:
This now means that the problem of skipping over nodes which are ready to execute is an issue of the past, as execution is now run as a batch operation across the jobs array of the node at the head of the queue.
During design, the process of sorting of Priority Queue was still quite slow. This needs to be a fast operation because the queue can only be grouped once it has been sorted. The grouping/mapping of jobs per node in the queue is a more computationally complex operation by default, but we also aimed to improve the speed of execution for this operation too.
While using a custom implementation of quicksort, we knew we could leverage Go’s built-in Goroutines to speed up our sorting algorithm. Goroutines are thin lightweight threads in the Go programming language which allow for the concurrent running of functions.
Our solution to the sorting problem can be seen in the code below:
Scaling Odin out to be a distributed system was a significant challenge set out in the Odin Functional Spec. Distributing Odin would offer increased reliability to the system in case of a failure.
Transforming Odin into a distributed system required a significant amount of terraforming. We utilized the Hashicorp implementation of the Raft consensus protocol. The Raft protocol requires the election of a Leader node from the pool of candidates. A finite state machine was designed to allow all started nodes to be added as voters in this election and in turn cast their votes to reach consensus.
New API endpoints for
leave were created so expanding the cluster was handled and nodes failing was handled. This did not prove to be a significant challenge but was necessary in managing the Raft Cluster also.
Another significant hurdle was deciding the way in which jobs were distributed to worker nodes. In designing a method of distributing work, we also needed to take into account how the work is distributed in the event of a node failing. In an example of 12 jobs and 4 nodes, ideally each node should execute 3 jobs each. If a node fails, then each node should take on some extra responsibility. These 3 remaining nodes will then share the 12 jobs by executing 4 jobs each.
Thus we reached a consensus of our own in how work was shared between nodes:
- Each node is assigned an integer, in the example of 4 nodes above this would be an integer between 0 and 3.
- Each job is then assigned an integer, so in the case of the twelve jobs an integer between 0 and 11.
- We then get the modulus of the current job integer (0-11) and the number of nodes in the cluster (currently 4). We call this mod.
- We then compare the node integer (0-3) to mod and if they are the same then that node is selected to execute the current job.
In pseudocode, the operation looks like this:
The diagram below represents Odin as it currently operates, making reference to the various components of the system architecture. This diagram serves as a demonstration of how these components link up to facilitate the flow of data. This architecture diagram goes to great lengths to show the internals of each component in a more structured way, something which early design diagrams did not include.
Beginning at the Odin CLI, we can see how each of the three aforementioned categories will link to three different places, with the job commands category linking to the jobs endpoint of the API, and the Join Cluster category linking to the join endpoint of the API. Finally we see how the recover and log commands in the /etc/odin/ category link to their respective sections on the filesystem.
We also see how the jobs endpoint itself is integral to the Odin Engine component. It interacts directly with the MongoDB instance, the Ticker, and the execute and links endpoints. The latter of these, the links endpoint, is something in the current design which was not present in the initial system design. The concept of linking jobs together was discussed in our Function Specification document but never truly explored. We can see this concept has been manifested through the presence in the Odin Engine through the links endpoint.
The join endpoint simply points back to the Odin Engine component itself as a way to signify the replication of the service in a Raft cluster. The Odin CLI directly links to this endpoint to facilitate the ease of expanding this cluster.
The user code which is actually executed is always stored in /etc/odin/jobs under a directory of the job’s ID value. This was a design detail which was never explored in the original design - how orchestration would be achieved. Making a copy of the files used to run a job ensures redundancy, as users may accidentally remove files scattered across their home directory.
We can see how the Odin SDK (of any language) is used to log values and report them to the MongoDB instance. These values can be extracted by the Odin Dashboard back-end server sent to be displayed on an Angular front-end.
In all, Odin solves the problem of periodic job scheduling in a way not currently available to the open source community. This is in specific reference to the Odin Software Development Kits, which centralize all job information/metrics. We feel that the observability features are a significant achievement as it reduces any associated toil with debugging a job which has broken, helping infer the internal state of jobs and any given time.