DEV Community

Hana Wang
Hana Wang

Posted on

Quickstart with OpenMLDB

As discussed in the previous post, to eliminate the cumbersome offline-online consistency verification procedure, you can use OpenMLDB in the development cycle, which enables data scientists to implement the feature scripts with an SQL-like language, and straightaway deploy the scripts online which can provide millisecond scale feature serving responses.

Concepts

Workflow

Figure 1


Figure 1: Typical Workflow of OpenMLDB in Feature Development and Serving and Respective Execution Modes

  1. Offline Data Import: Importing offline data for offline feature engineering development and debugging.
  2. Offline Feature Development: Developing feature engineering scripts and debugging them until satisfactory results are achieved. This step involves joint debugging of machine learning models. This article primarily focuses on the feature engineering part.
  3. Feature Scheme Deployment: Once satisfactory feature scripts are obtained, they are deployed for production.
  4. Cold Start Online Data Import: Before going live, it’s necessary to import data within the required time window into the online storage engine. For example, if the feature scheme involves aggregating features for the past three months of data, cold start requires importing data from the previous three months.
  5. Real-Time Data Stream Integration: After the system is live, over time, the latest data needs to be integrated to maintain the window calculation logic.
  6. Online Data Preview (Optional): A preview check of online data can be performed using supported SQL commands. This step is not mandatory.
  7. Real-Time Feature Computation: Once the feature scheme is deployed, and data is correctly integrated, you will have a real-time feature calculation service that can respond to online requests.

We refer to the above steps to illustrate a full development process for feature engineering with OpenMLDB.

Execution Mode

As depicted in Figure 1, the 7 steps work in different execution modes: offline mode, online preview mode, and online request mode. We will briefly review each mode of what they are and general usage and behavior.

  • Offline Mode
    The default mode for OpenMLDB CLI after startup is the offline mode. Offline data import (1) and offline feature development (2) are performed in offline mode. The purpose of the offline mode is to manage and compute on offline data. The computational nodes involved are supported by the OpenMLDB Spark distribution optimized for feature engineering, and storage nodes support the use of common storage systems such as HDFS.

  • Online Preview Mode
    Online deployment of feature schemes (3), cold start online data import (4), real-time data stream integration (5), and online data preview (6) are performed in online preview mode. The purpose of online preview mode is to manage and preview online data. The storage and computation of online data are supported by the tablet component.

  • Online Request Mode
    After deploying feature scripts and integrating with online data, the real-time feature computation (7) service is ready, and you can perform real-time feature extraction using the online request mode. REST APIs and SDKs are supported for online request mode. The online request mode is a unique mode in OpenMLDB that supports online real-time computation and is quite different from typical SQL queries in common databases.

Quickstart

Now let’s dive into how to quickly start your feature engineering journey with OpenMLDB. Here we recommend using docker for a stable and consistent environment. The minimum required docker version is >=18.03. We recommend running in Linux or Windows.

Preparation

  1. Pull docker image
docker run -it 4pdosc/openmldb:0.8.3 bash
Enter fullscreen mode Exit fullscreen mode
  1. Start OpenMLDB server (all action performed inside the docker container)
/work/init.sh
Enter fullscreen mode Exit fullscreen mode
  1. Start OpenMLDB CLI
/work/openmldb/bin/openmldb --zk_cluster=127.0.0.1:2181 --zk_root_path=/openmldb --role=sql_client
Enter fullscreen mode Exit fullscreen mode

Successful startup will show something like this:

Figure 2


Figure 2: Successful Startup of OpenMLDB CLI

Usage

As a quick demonstration, we will simplify the steps as follows: offline data import (1), offline feature development (2), feature scheme deployment (3), online data import (4 and 5 simplified), and real-time feature computation (7).

Offline Data Import

First, you can create a database demo_db, and a table demo_table1:

-- OpenMLDB CLI
CREATE DATABASE demo_db;
USE demo_db;
CREATE TABLE demo_table1(c1 string, c2 int, c3 bigint, c4 float, c5 double, c6 timestamp, c7 date);
Enter fullscreen mode Exit fullscreen mode

Then, set the execution mode to offline, and import offline data for offline feature computation:

-- OpenMLDB CLI
USE demo_db;
SET @@execute_mode='offline';
LOAD DATA INFILE 'file:///work/taxi-trip/data/data.parquet' INTO TABLE demo_table1 options(format='parquet', mode='append');
Enter fullscreen mode Exit fullscreen mode

Offline Feature Development、

Here you can develop your own feature scripts, and calculate offline features similar to below:

-- OpenMLDB CLI
USE demo_db;
SET @@execute_mode='offline';
SET @@sync_job=false;
SELECT c1, c2, sum(c3) OVER w1 AS w1_c3_sum FROM demo_table1 WINDOW w1 AS (PARTITION BY demo_table1.c1 ORDER BY demo_table1.c6 ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) INTO OUTFILE '/tmp/feature_data' OPTIONS(mode='overwrite');
Enter fullscreen mode Exit fullscreen mode

Feature Scheme Deployment

Now that you have your feature script developed and tested, you can deploy it online for serving! You can name your service, for example, demo_data_service.

-- OpenMLDB CLI
SET @@execute_mode='online';
USE demo_db;
DEPLOY demo_data_service SELECT c1, c2, sum(c3) OVER w1 AS w1_c3_sum FROM demo_table1 WINDOW w1 AS (PARTITION BY demo_table1.c1 ORDER BY demo_table1.c6 ROWS BETWEEN 2 PRECEDING AND CURRENT ROW);
Enter fullscreen mode Exit fullscreen mode

Online Data Import

In online preview mode, import online data for online feature computation.

-- OpenMLDB CLI
USE demo_db;
SET @@execute_mode='online';
LOAD DATA INFILE 'file:///work/taxi-trip/data/data.parquet' INTO TABLE demo_table1 options(format='parquet', header=true, mode='append');
Enter fullscreen mode Exit fullscreen mode

You can preview the data.

-- OpenMLDB CLI
USE demo_db;
SET @@execute_mode='online';
SELECT * FROM demo_table1 LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

Real-Time Feature Computation

Now that you have finished most of the development and deployment, let’s test your online feature service! The Real-time online service is live at:

http://127.0.0.1:9080/dbs/demo_db/deployments/demo_data_service
        \___________/     \____/              \_____________/
              |              |                       |
     ** APIServerAddress  Database Name         Deployment Name**
Enter fullscreen mode Exit fullscreen mode

To test, first exit OpenMLDB CLI.

-- OpenMLDB CLI
quit;
Enter fullscreen mode Exit fullscreen mode

You can now query by putting data in the input field:

curl http://127.0.0.1:9080/dbs/demo_db/deployments/demo_data_service -X POST -d'{"input": [["aaa", 11, 22, 1.2, 1.3, 1635247427000, "2021-05-20"]]}'

Enter fullscreen mode Exit fullscreen mode

Expected query result:
{"code":0,"msg":"ok","data":{"data":[["aaa",11,22]]}}

Voila! That’s it! If you want more details, visit the official Quickstart documentation.


For more information on OpenMLDB:

This post is a re-post from OpenMLDB Blogs.

Top comments (0)