Apache SeaTunnel

Posted on Aug 10, 2022

The story behind Apache SeaTunnel’s evolving from a data integration component to an enterprise-level service

#bigdata #service

Apache SeaTunnel (Incubating) Meetup in July: topic sharing about how Apache SeaTunnel (Incubating) evolves from a data integration component to an enterprise-level service and data integration platform, we hope you will harvest a lot out of this talk.

Summary of the talk:

The original intention and value of servitization
The overall architecture of the service
The current progress of the Community
Roadmap

Why do we need to do servitization?

The web service of SeaTunnel is long-expected by the community since 2019, and the discussion about this issue lasts until the weekly meeting in May this year. Some community member claimed that they would like to contribute their work to the community, but no one follows up the work due to various reasons.

In the previous Meetup, I saw that some developers also shared the visual data integration service based on SeaTunnel. As an open-source lover, I have been working on data middleware-related work, and I think servitization is an essential part of SeaTunnel, so I decided to make the community's wishes come true and started working on this.

What are the core goals?

Scripts management
One of the core goal of this work is allowing users to configure task information in the form of parameters through WebUI, such as input/output data sources, the configuration of various transforms, env parameters, etc.; in short, to let users express their business needs through configuration rather than scripting as much as possible.

As the company I previously worked for did not have a data platform, we used DataX+Azkaban as the data collection component and the scheduling execution component respectively, and used Git as the code management. A skilled person can configure a data integration task in about 7 steps: Edit, Commit, Push, Pack, Upload, Page operation, and Data verification; It takes at least 30-60 minutes to develop a task, and you have to ensure that you can't be disturbed in between and that there are no exceptions. Later, after we self-built our data platform, it took only a minute to configure a data exchange task.

Job and instance management

Here is a fundamental concept, a SeaTunnel task, we generally called it a script during development, and after testing and release, we will call it a job. And after the job is triggered, the specific execution is called the execution instance.

Control of jobs: manual triggering (including data replenishment and single triggering), suspend scheduling, view upstream and downstream dependencies, view job content, etc.
Control of instances: rerun, KILL, view logs, etc.

Usually, this part of the capability is taken up by the scheduling systems in the whole big data ecosystem, such as DolphinScheduler, Azkaban, etc. So, why do we need to do these things in SeaTunnel? I'll leave you in suspense and explain my thoughts on the overall architecture.

After we finish building these two capabilities, SeaTunnel can be considered as a complete data integration solution. Any person or enterprise can quickly complete the definition of data collection tasks after downloading and simple configuration, and by publishing the tasks to the scheduling system or the built-in scheduling engine, we can achieve the periodic scheduling of tasks, and the business data, application logs, and other data can be quickly, accurately, and regularly synchronized to the big data storage platform or OLAP engine to perform data analysis fast, and accelerate the generation of a data value. So these are the two core objectives of this project.

The above can be seen as valuable to users.
While, *what can servitization bring to SeaTunnel itself? *

When there was no SeaTunnel or Waterdrop (the former name of SeaTunnel), the developer of Waterdrop used Spark for data integration and found that they could encapsulate some operations by precipitating the common operations with code. And this process laid the foundation of early Waterdrop over time, which is a typical case of that a basic data development component, such as Spark, evolves to a data integration tool.

Later, SeaTunnel started to build its operation and maintenance control platform, integrating its scripts, development tools, and development processes to unify management and form a systematic platform for the integrated operation and maintenance development control. The platformization brings control, development, operation, and maintenance capabilities, which attracts more users, and developers to join the SeaTunnel community, which is undoubtedly a great benefit to the development of our community.

The overall architecture of the service

On the whole, the overall architecture of the SeaTunnel service is temporarily divided into three main parts

1. Control: control of data sources, users, permissions, scripts, jobs, instances, and anything displayed on Web-UI.

Scheduling: it is responsible for dropping tasks to different scheduling systems for scheduling and execution according to the configuration; the control of upper-level jobs and instances also depends on specific scheduling systems.
Execution: issues during the specific execution of the task, you can see that I made task-wrapper here, and I will explain the role of it in detail as follows.

Brief description of management and control

Management capabilities

For the data-source insert, delete, update, select and connectivity testing, the subsequent will support the mapping of data sources, data probing, and other capabilities.

In addition to insert, delete, update, and check, the management also involves registration, login, and exit; however, if SeaTunnel is aspiring to be a top-level, self-contained, multi-tenant environment data integration service, then the management of users will be more complex.

Everything displayed on the page, like menu, button, data, etc. on should be incorporated into the control, in addition, the management module should contain more content, such as resource management, custom connector, transform management, project space, etc., but these are not in the scope of our main process, and there is really no user demand for this yet, so we will leave it for discussion later.

Development Capabilities
It's basically about inserting, deleting, updating, and checking the scripts and especially the ability to edit scripts: save, execute, stop, test, publish, basic parameter display, scheduling parameter configuration, alarm parameter adjustment, script content, data source, transform, concurrency, etc.; there are actually a lot of issues to talk about here, such as testing, which is usually replacing the output source, and output through the console, and manually to estimate whether the script configuration is correct, and if you want it to be more intelligent, you can deploy a unit test, which allows release online only when each script passes the unit test. The unit testing is performed like the usual JAVA unit test we write, we mock the data, and verify the process.

Release is an essential step in the evolution of the script into a job, and only the successful release of the script will be truly synchronized to the scheduling system. There are a lot of things we can control in the release process, such as only a script that has been tested can be released successfully, or an OA approval flow or workflow generated after a release application submitted requires for approval to be released, and so on.

Of course, because SeaTunnel itself is positioned as a weak T data integration component, which determines that we will not have too much ETL logic, and we will focus only on SeaTunnel task, and we will not be much interested in folder management, directory tree capabilities, which I think it is more appropriate to be done by a separate component, such as open sourced Web IDE.

Operation and maintenance capabilities

As I mentioned before, the operation and maintenance of jobs are generally: manual triggering (including data replenishment and single trigger), suspend scheduling, view job content, etc., while the operation and maintenance of instances are generally: rerun, KILL, view logs, etc., but it is worth noting that our jobs have real-time and offline, so there are different reflections in the operation and maintenance of jobs and instances: real-time tasks do not have scheduling cycles, there is no task dependency. There is no scheduling cycle and no task dependency, so the O&M of real-time tasks will be different.

Scheduling Introduction

As a critical part of O&M, scheduling is divided into two parts.

Scheduling agent

You must be wondering why we need to do some scheduling-related work in SeaTunnel, such as the above-mentioned job and instance operations and maintenance, and why we cannot achieve control and O&M in the scheduling system directly after the script is pushed to the scheduling system.

First of all, we want SeaTunnel to be a self-contained system that can accomplish certain capabilities independently other than rely on some components. Secondly, there are many scheduling systems and each API and capability behaves inconsistently, so we need to implement one integration for each scheduling system. The absence of an abstract API layer will cause the whole code to be very confusing and difficult to manage.
Finally, it would be much less difficult if we only implement for one scheduling system, but we would inevitably lose users who use other scheduling systems; of course, they could also do it via shell scripts as they did in the past, while the role of servitization would be weakened to a great extent.

What is the role of crontabl-local?

For small and micro users who may not be supported by any big data platforms or data warehouse professionals, they used to perform data analysis through MYSQL or PYTHON scripts. But as the amount of data increases, running analytical tasks with MYSQL and PYTHON appeared to be too slow and resource-intensive, so they may choose to use some OLAP engines; and when they want to analyze data from business libraries, they are bound to use SeaTunnel to synchronize the data to OLAP engines.
Small and micro users are not accessible to data professionals, much less the big data platform or scheduling system. And this is where crontabl-local comes into its own: SeaTunnel is self-contained, and it provides simple scheduling capabilities itself, which only requires the users for simple configuration modification to quickly get started and complete the configuration and release of timed data integration tasks.

Execution state task-wrapper

When we open it through the IDE, we will see the directory structure as shown in the figure.
It provides pre-task and post-task capabilities. The reason for designing such a capability is that SeaTunnel's native capabilities alone are insufficient.

For example, schema-evolution, synchronization pre-processing under sub-base and sub-table, dynamic partitioning, data quality, etc. Of course, users themselves can implement the corresponding capabilities through scheduling dependencies, but many times POST-TASK needs to be merged with SeaTunnel execution scripts to ensure transaction consistency, which cannot be guaranteed if split into two tasks. We provide the ability of pre-task and post-task to be assembled with SeaTunnel's execution engine into accurate execution content; we also support users to implement these two task types themselves to implement their business processes.

In addition to pre-task and post-task, the other key point is to execution the script truly, which requires the wizard mode and canvas mode translated into the real executed script, packaged with pre-task and post-task, and then passed to the real scheduling system.

*The current progress of the community
Development progress tracking
*

You can track the progress through the 1947 issue number on GitHub. 3 links are shown on the picture, the first one is about the general design and the overall progress tracking. The second one is the PR about script management & user management, which has been MERGED, and the third one is about the design of the scheduling agent layer and the integration with DolphinScheduler, which has been developed, and the PR is expected to be MERGED within a month.

Script editing mode

At present, only script mode is supported in SeaTunnel. We can edit and develop scripts on this page, and the preview on the right side is convenient for users to better position their code; however, the basic information, parameter configuration, and scheduling configuration entrance are missing here. This issue is under discuss by the community product developers, and we will be solved later.

After saving the script, you will come to the picture above, this is also the entrance of script creation, start/edit is to start and edit; the update here will be changed to publish, which means to publish; a script can’t be started, stopped or operated until it’s published; The state here is the state of the last execution of this task, if there is no execution, the record is unstart, the rest is easy to understood, so I won't repeat it.

Clicking on any task name will bring you into this page, on which you can see the task execution index information, the number of input and output data, data size, time consumed, log of the current operation, the history of execution records, etc.
More product prototypes can be found in issue-2100, where many pictures are shown.

When will SeaTunnel-Server be available?

Rome was not built in a day, we will make a stable and available minimum MVP version, which contains:

1.** User management**: empowered with the ability to add, delete, change and check users account as well as login and exit.

Script development: including the addition, deletion, and checking of scripts, and it only supports script mode. As I mentioned before in the issue, there are three script development modes: wizard mode, namely configuration method, you need to select the input source first, then input and output sources, and configure the field mapping, etc.; script mode, means that the SeaTunnel's script is directly pasted in; canvas mode is commonly known as drag and drop. 3.** Task operation and maintenance**: for the script execution after release, stop, records viewing, and logs related operations viewing.

ROADMAP

We have completed the design and development of the MVP version, which is a milestone for the project, including users, scripts, and tasks O&M. This means that it takes us about 2 months to complete one round of iteration since 1.0 version release, and 2 small versions iteration are expected to complete by the end of the year.

[Version 1.1] Incorporates data source management, allowing users to focus more on business code development.
*[Version 1.2] **Development, control, operation, and maintenance of real-time tasks, why should we separate real-time and offline tasks? That’s because real-time tasks are usually in a 7*24 mode, and their running state and the instance state shown on the scheduling system are inconsistent.
*[Version 1.3] **The control of user rights, here the order and the work content is a temporary decision of my own, if more people join in, I think the version can be richer and can be iterated faster.

As for the wizard mode everybody cares about, it will be designed & developed next year, because it depends much on the front-end & back-end.
In the 2.0 version, we will try our best to reach the goals below:

Start the wizard mode from zero and optimize it continuously;
Full coverage of task operation and maintenance: that is, to improve the ability of the module of operation and maintenance; and there may be more scheduling system accessed, such as Airflow, Azkaban, etc...
Empower business capabilities through pre-task and post-task capabilities.

More over, with the development of our SeaTunnel's engine, I believe it will bring more capabilities and convenience to our development, operations, and maintenance.

For example, the dirty data collection, flow control, and other parameters can be configured by scripts; on the O&M side we can see more professional and clearer data integration metrics in SeaTunnel, which can be integrated and displayed directly on our SeaTunnel Web UI.

We expect to spend 6 to 10 months to tackle these issues in cooperating with the community contributors on the engine side.

As for version 3.0, there is still a long journey. I think it should fully cover the canvas mode, resource management, and stream batching capabilities, and finally, welcome to contribute and join our Apache SeaTunnel family! Thank you all!

About SeaTunnel
SeaTunnel (formerly Waterdrop) is an easy-to-use, ultra-high-performance distributed data integration platform that supports the real-time synchronization of massive amounts of data and can synchronize hundreds of billions of data per day stably and efficiently.

Why do we need SeaTunnel?
SeaTunnel does everything it can to solve the problems you may encounter in synchronizing massive amounts of data.

Data loss and duplication
Task buildup and latency
Low throughput
Long application-to-production cycle time
Lack of application status monitoring

SeaTunnel Usage Scenarios

Massive data synchronization
Massive data integration
ETL of large volumes of data
Massive data aggregation
Multi-source data processing

Features of SeaTunnel

Rich components
High scalability
Easy to use
Mature and stable

How to get started with SeaTunnel quickly?
Want to experience SeaTunnel quickly? SeaTunnel 2.1.0 takes 10 seconds to get you up and running.

https://seatunnel.apache.org/docs/2.1.0/developement/setup
How can I contribute?

We invite all partners who are interested in making local open-source global to join the SeaTunnel contributors family and foster open-source together!

Submit an issue:
https://github.com/apache/incubator-seatunnel/issues
Contribute code to:
https://github.com/apache/incubator-seatunnel/pulls
Subscribe to the community development mailing list :
dev-subscribe@seatunnel.apache.org
Development Mailing List :
dev@seatunnel.apache.org

Join Slack:
https://join.slack.com/t/apacheseatunnel/shared_invite/zt-10u1eujlc-g4E~ppbinD0oKpGeoo_dAw
Follow Twitter:
https://twitter.com/ASFSeaTunnel
Come and join us!

DEV Community

The story behind Apache SeaTunnel’s evolving from a data integration component to an enterprise-level service

The overall architecture of the service

Brief description of management and control

ROADMAP

Top comments (0)

Read next

AWS CLI: Instalación en Windows y Linux, y Uso Básico

Offering help with a project

Advent of Code 2024 - Day 20: Race Condition

Ever save a LinkedIn post thinking, “I’ll get back to this later,” but totally forget?