DEV Community

WanjohiChristopher
WanjohiChristopher

Posted on

Automating Talend Jobs Using Apache Airflow .

Table Of Contents

  1. Introduction to Talend and Apache Airflow.
  2. Prerequisite for Setting up a Talend and creating an ETL pipeline.
  3. Exporting Talend job as a bash script for automation in Airflow
  4. Apache airflow introduction, dags, tasks, and operators
  5. Scheduling of ETL jobs.
  6. Conclusion

Introduction

Welcome to our new article on the automation of Tasks.
Talend is an open-source data integration platform that helps connect data from various sources virtually and perform ETL or ELT pipelines to your destination, such as Data warehouses. Talend provides a wide range of capabilities, such as data integration, data quality, data management, and data governance.

Apache Airflow, on the other hand, is an open-source platform developed by Airbnb for authoring, scheduling, and monitoring data workflows. Airflow helps us automate tasks easily by defining complex workflows as DAGS (Direct Acyclic Graphs) with a defined schedule without manually triggering them as we do in Talend.

Prerequisites of Setting up Talend and creating an ETL pipeline:

In order to follow easily and get to know how we migrate data from databases to a Data warehouse, it is important to check out Part 1 of setting up Talend Open Studio here. completes the full ETL we did and sets you up with the following on this project.

Without getting into the nitty-gritty of the task, we will start showing you how to export your ETL workflow as a standalone executable Bash Script.

Exporting Jobs in Talend Steps:

  • In Talend Studio, Right click on the job, and navigate to “Export Items.”.
  • Select “Archive file”, then check the button on “Export dependencies”.
  • Click on “Finish” as shown below and navigate to the Talend workspace to locate the file.

Talend
We now locate our files in the folder; we will use only the bash file labeled by “.sh”. See the below snapshot.

Talend files

Our Talend part is now complete. We move our bash file to the airflow folder.

Airflow Introduction

Without getting into the nitty-gritty of the airflow, let's get our hands dirty.
In order to set up, understand more about Apache Airflow, and work on Big data projects, use this link ,which helps set up everything.

A DAG is a workflow representation where tasks are represented as nodes and directed edges define the dependencies between tasks. Each node corresponds to a specific task and may be associated with operators that dictate how the task is executed. The DAG helps visualize the task relationships and their execution order during workflow design. However, during task execution, a workflow management system interprets the DAG and automatically orchestrates the tasks based on their dependencies, ensuring They are executed correctly. DAGs are widely used for workflow automation and task scheduling in Python.

Now let's dive into the scripts to automate our tasks, In our case, we will be using a bash operator since our file is a “.sh” bash file.

In your dags folder, create a python file named ‘datamigrate.py’.
Import the following libraries, as we’ll use them.

Airflow setup

Airflow Dag

Dag_id: We name what the dag does.
Schedule_interval : in this case, we set our tasks to migrate and run every Wednesday at 8PM.
Start_date : set to be 26th July 2023.

We now need to set task to execute our script.

Bash Script

BashOperator: in airflow, we use it because we are using a bash file.
Task_id : since we can define multiple tasks, we need to document each so that we can refer to them in our execution.
Bash_command : absolute path to the bash file.

Once we run our dag, we get our data ingestion into a target data warehouse in Postgres. For the execution of airflow, kindly use this article, as it explains it in detail.

We have successfully automated our tasks!!. You can also use this to do more automation in different domains.

Conclusion

In conclusion, we have gone through how we can use Talend as our ETL tool to extract data from Microsoft SQL Server to a Postgres data warehouse. We later automated the data lineage using airflow so that we didn't have to check on it manually while considering auditing and logging.
In a nutshell, you are now able to automate manual tasks in Python.

Happy Learning.!!

Top comments (0)