As discussed in the previous post, to eliminate the cumbersome offline-online consistency verification procedure, you can use OpenMLDB in the development cycle, which enables data scientists to implement the feature scripts with an SQL-like language, and straightaway deploy the scripts online which can provide millisecond scale feature serving responses.
Concepts
Workflow
- Offline Data Import: Importing offline data for offline feature engineering development and debugging.
- Offline Feature Development: Developing feature engineering scripts and debugging them until satisfactory results are achieved. This step involves joint debugging of machine learning models. This article primarily focuses on the feature engineering part.
- Feature Scheme Deployment: Once satisfactory feature scripts are obtained, they are deployed for production.
- Cold Start Online Data Import: Before going live, it’s necessary to import data within the required time window into the online storage engine. For example, if the feature scheme involves aggregating features for the past three months of data, cold start requires importing data from the previous three months.
- Real-Time Data Stream Integration: After the system is live, over time, the latest data needs to be integrated to maintain the window calculation logic.
- Online Data Preview (Optional): A preview check of online data can be performed using supported SQL commands. This step is not mandatory.
- Real-Time Feature Computation: Once the feature scheme is deployed, and data is correctly integrated, you will have a real-time feature calculation service that can respond to online requests.
We refer to the above steps to illustrate a full development process for feature engineering with OpenMLDB.
Execution Mode
As depicted in Figure 1, the 7 steps work in different execution modes: offline mode, online preview mode, and online request mode. We will briefly review each mode of what they are and general usage and behavior.
Offline Mode
The default mode for OpenMLDB CLI after startup is the offline mode. Offline data import (1) and offline feature development (2) are performed in offline mode. The purpose of the offline mode is to manage and compute on offline data. The computational nodes involved are supported by the OpenMLDB Spark distribution optimized for feature engineering, and storage nodes support the use of common storage systems such as HDFS.Online Preview Mode
Online deployment of feature schemes (3), cold start online data import (4), real-time data stream integration (5), and online data preview (6) are performed in online preview mode. The purpose of online preview mode is to manage and preview online data. The storage and computation of online data are supported by the tablet component.Online Request Mode
After deploying feature scripts and integrating with online data, the real-time feature computation (7) service is ready, and you can perform real-time feature extraction using the online request mode. REST APIs and SDKs are supported for online request mode. The online request mode is a unique mode in OpenMLDB that supports online real-time computation and is quite different from typical SQL queries in common databases.
Quickstart
Now let’s dive into how to quickly start your feature engineering journey with OpenMLDB. Here we recommend using docker for a stable and consistent environment. The minimum required docker version is >=18.03. We recommend running in Linux or Windows.
Preparation
- Pull docker image
docker run -it 4pdosc/openmldb:0.8.3 bash
- Start OpenMLDB server (all action performed inside the docker container)
/work/init.sh
- Start OpenMLDB CLI
/work/openmldb/bin/openmldb --zk_cluster=127.0.0.1:2181 --zk_root_path=/openmldb --role=sql_client
Successful startup will show something like this:
Usage
As a quick demonstration, we will simplify the steps as follows: offline data import (1), offline feature development (2), feature scheme deployment (3), online data import (4 and 5 simplified), and real-time feature computation (7).
Offline Data Import
First, you can create a database demo_db
, and a table demo_table1
:
-- OpenMLDB CLI
CREATE DATABASE demo_db;
USE demo_db;
CREATE TABLE demo_table1(c1 string, c2 int, c3 bigint, c4 float, c5 double, c6 timestamp, c7 date);
Then, set the execution mode to offline, and import offline data for offline feature computation:
-- OpenMLDB CLI
USE demo_db;
SET @@execute_mode='offline';
LOAD DATA INFILE 'file:///work/taxi-trip/data/data.parquet' INTO TABLE demo_table1 options(format='parquet', mode='append');
Offline Feature Development、
Here you can develop your own feature scripts, and calculate offline features similar to below:
-- OpenMLDB CLI
USE demo_db;
SET @@execute_mode='offline';
SET @@sync_job=false;
SELECT c1, c2, sum(c3) OVER w1 AS w1_c3_sum FROM demo_table1 WINDOW w1 AS (PARTITION BY demo_table1.c1 ORDER BY demo_table1.c6 ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) INTO OUTFILE '/tmp/feature_data' OPTIONS(mode='overwrite');
Feature Scheme Deployment
Now that you have your feature script developed and tested, you can deploy it online for serving! You can name your service, for example, demo_data_service.
-- OpenMLDB CLI
SET @@execute_mode='online';
USE demo_db;
DEPLOY demo_data_service SELECT c1, c2, sum(c3) OVER w1 AS w1_c3_sum FROM demo_table1 WINDOW w1 AS (PARTITION BY demo_table1.c1 ORDER BY demo_table1.c6 ROWS BETWEEN 2 PRECEDING AND CURRENT ROW);
Online Data Import
In online preview mode, import online data for online feature computation.
-- OpenMLDB CLI
USE demo_db;
SET @@execute_mode='online';
LOAD DATA INFILE 'file:///work/taxi-trip/data/data.parquet' INTO TABLE demo_table1 options(format='parquet', header=true, mode='append');
You can preview the data.
-- OpenMLDB CLI
USE demo_db;
SET @@execute_mode='online';
SELECT * FROM demo_table1 LIMIT 10;
Real-Time Feature Computation
Now that you have finished most of the development and deployment, let’s test your online feature service! The Real-time online service is live at:
http://127.0.0.1:9080/dbs/demo_db/deployments/demo_data_service
\___________/ \____/ \_____________/
| | |
** APIServerAddress Database Name Deployment Name**
To test, first exit OpenMLDB CLI.
-- OpenMLDB CLI
quit;
You can now query by putting data in the input
field:
curl http://127.0.0.1:9080/dbs/demo_db/deployments/demo_data_service -X POST -d'{"input": [["aaa", 11, 22, 1.2, 1.3, 1635247427000, "2021-05-20"]]}'
Expected query result:
{"code":0,"msg":"ok","data":{"data":[["aaa",11,22]]}}
Voila! That’s it! If you want more details, visit the official Quickstart documentation.
For more information on OpenMLDB:
- Website: https://openmldb.ai/
- GitHub: https://github.com/4paradigm/OpenMLDB
- Documentation: https://openmldb.ai/docs/en/
- Join us on Slack !
This post is a re-post from OpenMLDB Blogs.
Top comments (0)