DEV Community

Ruikai Li
Ruikai Li

Posted on

Learning Workflow Schedulers (Oozie)

The workflow schedulers are widely used in data engineering project, while the choises of the workflow schedulers can be different in different cases. This article aims to make a simple introduction of one of the most popular workflow scheduler -- Oozie and show some basic example of how to use this workflow scheduler.

Oozie

Apache Oozie is a scheduler system which is designed to manage Apache Hadoop jobs, including Hive, Pig and Sqoop. It can also manage other types of jobs like Spark, Java and shell.
Apache Oozie Structure

Oozie client provides oozie cli, java api to manipulate the workflow. Oozie Web App is a servlet container and this web app will help user to monitor and manage the workflow as a visual interface. The actual jobs will be launched and run in the Hadoop Cluster.

There are three concept in Apache Oozie: Workflow, Coordinator, Bundle. The relationship between these three is shown as below.
Bundle, Coordinator, Workflow

Workflow

Workflow in Oozie is a sequence of actions arranged in a control dependency DAG (Direct Acyclic Graph). Oozie workflow can be created from workflow.xml and it can be parameterized from job.properties. Please note that the workflow.xml should be located in HDFS, and the properties file should be located in the edge node(not in HDFS).

job.properties

nameNode=hdfs://localhost:8020
jobTracker=localhost:8032
oozie.wf.application.path=hdfs://namenodepath/workflow.xml
Enter fullscreen mode Exit fullscreen mode

workflow.xml

<workflow-app xmlns = "uri:oozie:workflow:0.4" name = "my-first-workflow">
   <start to = "job-1" />

   <!—Step 1 -->

   <action name = "job-1">
      <hive xmlns = "uri:oozie:hive-action:0.4">
         <job-tracker>${jobTracker}</job-tracker>
         <name-node>${nameNode}</name-node>
         <script>script_path/script1.hive</script>
      </hive>

      <ok to = "job-2" />
      <error to = "kill_job" />
   </action>

   <!—Step 2 -->

   <action name = "job-2">
      <hive xmlns = "uri:oozie:hive-action:0.4">
         <job-tracker>${jobTracker}</job-tracker>
         <name-node>${nameNode}</name-node>
         <script>script_path/script3.hive</script>
         <param>database</param>
      </hive>

      <ok to = "end" />
      <error to = "kill_job" />
   </action>

   <kill name = "kill_job">
      <message>Job failed</message>
   </kill>

   <end name = "end" />

</workflow-app>
Enter fullscreen mode Exit fullscreen mode

The workflow can be submitted into Oozie by using oozie command-line tool. After excuting this command, the jobs in workflow above will be run step by step.

oozie job -run -config example/job.properties
Enter fullscreen mode Exit fullscreen mode

Oozie workflow also provide fork and join control node for the case when multiple jobs run parallel to each other. Here is the example below.

<workflow-app xmlns = "uri:oozie:workflow:0.4" name = "fork-workflow">
   <start to = "fork_node" />

   <fork name = "fork_node">
      <path start = "job-1"/>
      <path start = "job-2"/>
   </fork>

   <action name = "job-1">
      <hive xmlns = "uri:oozie:hive-action:0.4">
         <job-tracker>${jobTracker}</job-tracker>
         <name-node>${nameNode}</name-node>
         <script>script_path/script1.hive</script>
      </hive>

      <ok to = "join_node" />
      <error to = "kill_job" />
   </action>

   <action name = "job-2">
      <hive xmlns = "uri:oozie:hive-action:0.4">
         <job-tracker>${jobTracker}</job-tracker>
         <name-node>${nameNode}</name-node>
         <script>script_path/script2.hive</script>
      </hive>

      <ok to = "join_node" />
      <error to = "kill_job" />
   </action>

   <join name = "join_node" to = "job-3"/>

   <action name = "job-3">
      <hive xmlns = "uri:oozie:hive-action:0.4">
         <job-tracker>${jobTracker}</job-tracker>
         <name-node>${nameNode}</name-node>
         <script>script_path/script3.hive</script>
         <param>database</param>
      </hive>

      <ok to = "end" />
      <error to = "kill_job" />
   </action>

   <kill name = "kill_job">
      <message>Job failed</message>
   </kill>

   <end name = "end" />

</workflow-app>
Enter fullscreen mode Exit fullscreen mode
Coordinator

The Coordinator in Oozie allows the user to schedual the workflows. The schedual of workflows can according to the time, data or event predicates. For example, Coodinator can set the start datetime and end datetime of the workflow which will trigger the workflow's start and end.

Here is the definition of some core attribute which should be set in Coodinator

  • start: The start datetime of the coordination

  • end: The end datetime of the coordination

  • timezone: The timezone of the coodinator application

  • frequency: A five digits Cron expression to show the excution frequency of this workflow during the coordination

  • timeout: The maximum time that a materialized action will be waiting for the additional conditions to be satisfied before being discarded. Set this to '0' means that all the conditions must be satisfied during the time of materialization, otherwise the action will be discarded. Set this to '-1' means that there will be no timeout in the materialization, and all the materialized action will wait forever until the condition is satisfied.

  • cocurrency: The maximum number of actions that can be run at the same time. This allows multiple instance of coordinator application to be submiited and run at the same time.

  • throttle: This controls the maximum nubmer of waiting materialization during the coordination.

  • execution: This controls which coordination should be excuted first when multiple coordinations satisfied their execution condition.

Here is the simple Coordinator for a Workflow

workflow.xml

<workflow-app xmlns = "uri:oozie:workflow:0.4" name = "workflow-coordinated>
   <start to = "job-1" />

   <action name = "job-1">
      <hive xmlns = "uri:oozie:hive-action:0.4">
         <job-tracker>${jobTracker}</job-tracker>
         <name-node>${nameNode}</name-node>
         <script>${script_name}</script>
         <param>${database}</param>
      </hive>
      <ok to = "end" />
      <error to = "kill_job" />
   </action>

   <kill name = "kill_job">
      <message>Job failed</message>
   </kill>
   <end name = "end" />

</workflow-app>
Enter fullscreen mode Exit fullscreen mode

coordinator.xml

<coordinator-app xmlns = "uri:oozie:coordinator:0.2" name =
   "coordinator_for_workflow" frequency = "1 * * * *" start =
   "2022-01-01T02:00Z" end = "2023-12-31T00:00Z"" timezone = "Australia/Sydney">

   <controls>
      <timeout>-1</timeout>
      <concurrency>1</concurrency>
      <execution>FIFO</execution>
   </controls>

   <action>
      <workflow>
         <app-path>workflow_path/workflow.xml</app-path>
      </workflow>
   </action>

</coordinator-app>
Enter fullscreen mode Exit fullscreen mode
Bundle

The Bundle in Oozie is a upper level of Coordinator. It allows the user to define and execute several coordinator together. The kick-off-time in bundle is the time when the bundle should submit coordinator app. Here is the Bundle example below.

bundle.xml

<bundle-app xmlns = 'uri:oozie:bundle:0.1' 
   name = 'bundle_of_coordinators'>

   <controls>
      <kick-off-time>${kickOffTime}</kick-off-time>
   </controls>

   <coordinator name = 'coordinator_for_workflow' >
      <app-path>coordinator_path/coordinator.xml</app-path>
      <configuration>
         <property>
            <name>start_time</name>
            <value>time</value>
         </property>
      </configuration>
   </coordinator>

</bundle-app>
Enter fullscreen mode Exit fullscreen mode
Use Oozie On Cloudera Data Platform(CDP)

Cloudera Data Platform is a highly intergrated data platform for data management and data analytics. The Oozie also present in Cloudera Hue which is intergrated in CDP as well. We can upload our job script to HDFS path and attach them to the actions. The whole process that I mentioned above is totally codeless. Theres is also no need for coding on a xml file to arrange the workflow, and the only thing we need to do is to drag all the actions together and link them as a DAG. After that we run the jobs on CDP and the status of these jobs will be shown on Oozie Dashboard.

Image description

The Little Tail

This is just the first leaning footprint of the workflow scheduler. Hope this can also help you a little XD. Cheers!

Top comments (0)