Introduction: To meet the needs of rapid growth, Orange Connex officially introduced Apache Doris in 2022, and built a new data warehouse architecture with Apache Doris. During the process, various aspects such as service stability, query stability, and data synchronization were optimized. At the same time, a data platform based on Apache Doris has been established. Meanwhile, a lot of experience in use and optimization has been accumulated, and I will share it with you all.
Author | Fu Shuai | Head Developer of Orange Connex Big Data Dept.
Orange Connex(NEEQ: 870366) is a technology company serving global e-commerce. It is committed to providing customers with logistics, finance, big data and other service products through market analysis, system research and development and resource integration, and providing high-quality, all-round solutions. As a partner of eBay, eBay fulfillment by Orange Connex offers an exceptional delivery, including next-day delivery, same-day handling and late cut-off times.
With the development of the company's business, the traditional data warehouse architecture based on MySQL in the early days has not been able to cope with the rapid growth of company data. Business and decision-making require strongly for data timeliness and real-time capabilities of data warehouses. To meet the needs of rapid growth, Orange Connex officially introduced Apache Doris in 2022, and built a new data warehouse architecture with Apache Doris. During the process, service stability, query stability, data synchronization and other aspects were optimized. Meanwhile, a data platform with Apache Doris as the core has been established, and a lot of experience in use and optimization has been accumulated, and I will share with you all.
Data Architecture Evolution
| Early data warehouse architecture
When the company just got started, the business is relatively small with a few data team members. The demand for data was limited to a small number of T + 1 customized reports Therefore, the early data warehouse architecture is quite simple. As shown in the figure below, MySQL is directly used to build the DM(Data Mart) to develop reports requiring T+1 data from demand side.
- The use of MySQL for data analysis is increasingly unable to meet the requirements from the company’s expansion, burst of data volume and data timeliness.
- There is no division of data warehouses. The chimney-type development model has poor data reusability, high development costs, and cannot quickly respond to the needs of the business.
- Lack of control over data quality and metadata management.
| New Data Warehouse Infrastructure
To solve the increasingly prominent problems of the old architecture and adapt to the rapidly growing data and business needs, Apache Doris was officially introduced this year to build a new data warehouse infrastructure.
The Reason Why Apache Doris is Chosen:
Ease of Use - In the current application, the introduction of new technologies will face a large number of migration problems, so the product usability problem must be considered. Apache Doris is quite friendly to new users with low migration costs and maintenance costs:
- It adopts MySQL protocol and syntax, supports standard SQL, can access Doris through various client tools, and can seamlessly connect with BI tools
- It can query multiple tables with JOINs and provides a variety of optimization for JOIN in different scenarios
- The ecological expansion is great. It could either efficiently batch import offline data or real-time import online streaming data.
- Compared with other popular OLAP databases in the industry, Apache Doris’s architecture is simpler, there are only two processes, FE（Frontend） and BE(Backend). And it does not depend on any third-party systems or tools.
- It supports elastic scaling and is very friendly to deployment, operation and maintenance.
Performance - There are many JOIN operations in multi-tables, which have extremely high requirements on query performance of multi-table JOIN operations and real-time query. Apache Doris is implemented based on MPP architecture and comes with an efficient columnar storage engine, which can support:
- Data pre-aggregation and automatic update of pre-aggregation results
- Real-time update of data
- High concurrent query Based on the above reasons, we finally chose to build a new data warehouse with Apache Doris.
Introduction of the Architecture
The data warehouse architecture of Apache Doris is quite simple, does not depend on Hadoop components, and has low construction and maintenance costs.
As shown in the above architecture diagram, we have 4 types of data sources: business data MySQL, file system CSV, event tracking data and third-party system API; For different needs, different data import methods are used, for example: We uses Doris Stream Load to do file data import; we use DataX Doriswriter for data initialization; we use Flink Doris Connector for real-time data synchronization; And in data storage and computing layers, we use Doris. When we design the layers for Doris, we adopt ODS (Operation Data Store data, also known as the source layer), detail layer DWD, middle layer DWM, service layer DWS, application layer ADS as our layer design idea. Layered data after ODS schedules Doris SQL through Dolphin Scheduler for incremental and full data updates. Finally, the upper-layer data application uses the one-stop data service platform, which can be seamlessly connected with Apache Doris to provide data application services such as self-service analytics reports, self-service data retrieval, data dashbord, and user behavior analysis.
The data warehouse architecture solution based on Apache Doris can support both offline and real-time application scenarios, and the real-time Apache Doris data warehouse can cover more than 80% of business scenarios. This architecture greatly reduces R&D costs and improves development efficiency.
Of course, there are also some problems and challenges in the process of architecture construction, and we have optimized the problems accordingly.
Apache Doris Metadata Management and Data Lineage Implementation Scheme
Before there is no metadata management and data lineage, we often encounter some problems. For example, we want to find an indicator, but we don't know which table the indicator is in. We can only find relevant developers to confirm. Of course, there are also developers who forget the indicator. Location and logical case. Therefore, it can only be confirmed through layer-by-layer screening, which is very time-consuming.
Previously, we put information such as the hierarchical division of the table, indicators, and person in charge in the Excel table. This maintenance method is difficult to ensure its integrity, and it is also difficult to maintain. When the data warehouse needs to be optimized, it is impossible to confirm which tables can be reused and which tables can be merged. When the table structure needs to be changed, or the logic of the indicator needs to be modified, it is impossible to determine whether the change will affect the downstream tables.
We often receive complaints from users. Next, we will introduce how to solve these problems through metadata management and data lineage analysis solutions.
Metadata management and data lineage revolve around Apache Doris, while integrating DolphinScheduler's metadata. In the above figure, the table on the right is the techical metadata business, the data services that metadata indicators and data lineage analysis can be provided.
We divide metadata into two categories: technical metadata and business metadata:
· Attribute information and scheduling information of technical metadata maintenance table
· Business metadata maintains the caliber and normative information agreed in the application process of data
Data lineage implements table-level lineage and field-level lineage:
· Table-level lineage supports rough table relationships and cross-layer reference analysis
· Field-level lineage supports fine-grained impact analysis
Next, we will introduce the architecture and working principles of metadata management and data lineage.
Metadata Management and Data Lineage Implementation Solution
· Data collection: use the audit log plug-in Doris Audit Plugin provided by Apache Doris for data collection
· Data storage: Customized development of the audit log plug-in, using Kafka to store Doris audit log data
· Bloodline parsing: Doris SQL parsing using Druid
· Blood relationship storage: use Nebula Graph to store lineage data
· Business metadata: Because business metadata often occurs CRUD, MySQL is used to store business metadata information
· Search data: Use ElasticSearch to store data lineage query indexes and search index data for tables
Next, we will introduce the four components of the architecture: audit log collection and cleaning services, data lineage analysis services, metadata information integration services, and application interface services.
Collection/Cleaning Service for Apache Doris Audit Logs
Considering that if the data cleaning logic is placed in the audit log plugin, when the data cleaning logic changes, data omission may occur, which will affect bloodline analysis and metadata management. So we decouple the process of audit log plugin data collection and data cleaning. After the transformation, the audit log plugin can format the audit log data and send the data to Kafka. The data cleaning service first adds data rearrangement logic to the cleaning logic, and reorders the data sent by multiple audit log plugins to solve the problem of data disorder. Secondly, convert non-standard SQL into standard SQL. Although Apache Doris supports MySQL protocol and standard SQL syntax, there are some table building statements and SQL query syntax that are different from standard SQL. Therefore, we make non-standard SQL converted into MySQL standard statements. Finally the data can be sent to ES and Kafka.
Data Lineage Analysis Service
The data lineage analysis service uses Druid to parse Doris SQL, and recursively obtains the data between tables and fields through the Druid abstract syntax tree layer by layer. Finally, the data lineage is encapsulated and sent to the graph database, and the linage analysis query index is sent to ES. During analysis, technical metadata and business metadata are sent to the corresponding storage location.
Metadata Information Integration Service
The metadata information integration service draws on the architecture implementation of Metacat.
· Connector Manager is responsible for creating metadata links between Apache Doris and DolphinScheduler, and supports subsequent extensions of other types of data source access.
· Meta Service is responsible for the specific implementation of metadata information acquisition. Apache Doris metadata information is mainly obtained from the information Schema library, Restful API, and query results of SHOW SQL. The workflow metadata information and scheduling record information of DolphinScheduler are obtained from the DolphinScheduler Metabase.
We provide 3 types of APIs, data lineage API, metadata API and data behavior API.
· The data lineage API provides query services for tables, fields, data linage, and impact analysis.
· Metadata API provides metadata query and field search services.
· The data behavior analysis API provides query services for table structure change records, data read and write records, and output information.
The above is the entire content introduction of the overall scheme of metadata management and data lineage analysis architecture.
This year, we completed the construction of a real-time data warehouse with Apache Doris. After half a year of use and optimization, Apache Doris has become stable and can meet our production requirements.
The new real-time data warehouse greatly improves data computing efficiency and data timeliness. Taking the On Time Delivery business scenario as an example, to calculate the aging change of a 1000w single-track node, it takes more than 2 hours to calculate before using Apache Doris, and the calculation consumes a lot of resources. the off-peak calculation can only be performed during the idle period; After Doris, it only takes 3 minutes to complete the calculation. The full-link logistics timeliness table, which was updated once a week, can now be updated with the latest data every 10 minutes, achieving real-time data timeliness.
Thanks to the standardized SQL of Apache Doris, the difficulty of getting started is small, the learning cost is low, and all staff can participate in the migration of tables. The original table was developed using PowerBI, which requires a very in-depth understanding of PowerBI. The learning cost is high, and the development cycle is exceptionally long. Moreover, PowerBI does not use standard SQL, and the code readability is poor; Now it is based on Doris SQL and self-developed drag-and-drop. The development cost of tables plummeted, and the development cycle of most requirements dropped from weeks to days.
In the future, we will also continue to promote the construction of the data platform based on Apache Doris and continue to optimize the metadata management and the resolution rate of data lineage. Considering that data lineage is an application that everyone desires, we will consider contributing it to the community after optimization.
At the same time, we are starting to build a user behavior analysis platform, and we are also considering using Apache Doris as the core storage and computing engine. At present, the functions supported by Apache Doris in some analysis scenarios are not rich enough. For example, in the ordered window funnel analysis scenario, although Apache Doris supports the window_funnel function, the Array-related calculation functions required for the calculation of each layer of funnel transformation have not yet been supported. Fortunately, the upcoming Apache Doris 1.2 version will include the Array type and related functions. It is believed that Apache Doris will be implemented in more and more analysis scenarios in the future.
About the author: Fu Shuai, Big Data R&D manager, digital team of Orange Connex (China) Co., Ltd., responsible for the digital team's data platform and the application of the OLAP engine.
Oldest comments (0)