DEV Community

Cover image for Think Lab 2124 : Build & Deploy AI/ML Models w Multiple Datasets w AutoAI - Tutorial A
Jenna Ritten for IBM Developer

Posted on

Think Lab 2124 : Build & Deploy AI/ML Models w Multiple Datasets w AutoAI - Tutorial A

AutoAI Overview

AutoAI in Cloud Pak for Data automates ETL(Extract, Transform, and Load) and feature engineering process for relational data, saves data scientists months of manual data prep time and acheives results comparable to top performing data scientists.

The AutoAI graphical tool in Watson Studio automatically analyzes your data and generates candidate model pipelines customized for your predictive modeling problem. These model pipelines are created iteratively as AutoAI analyzes your dataset and discovers data transformations, algorithms, and parameter settings that work best for your problem setting. Results are displayed on a leaderboard, showing the automatically generated model pipelines ranked according to your problem optimization objective.

Collect your input data in a CSV file or files. Where possible, AutoAI will transform the data and impute missing values.

Notes:

  • Your data source must contain a minimum of 100 records (rows).
  • You can use the IBM Watson Studio Data Refinery tool to prepare and shape your data.
  • Data can be a file added as connected data from a networked file system (NFS). Follow the instructions for adding a data connection of the type Mounted Volume. Choose the CSV file to add to the project so you can select it for training data.

AutoAI Process

Using AutoAI, you can build and deploy a machine learning model with sophisticated training features and no coding. The tool does most of the work for you.

AutoAI automatically runs the following tasks to build and evaluate candidate model pipelines:

  • Data pre-processing
  • Automated model selection
  • Automated feature engineering
  • Hyperparameter optimization

In this Think Lab, you will see how to join several data sources and then build an AutoAI experiment from the joined data. The scenario we’ll explore in Part A of the Lab is for an outdoor company that wants to project sales for each product in multiple retails stores. You will learn how to join several data sources related to a fictional outdoor store named Go, then build an experiment that uses the data to train a machine learning experiment. You will then deploy the resulting model and use it to predict daily sales for each product Go sells.

Project Requirements

IBM Cloud (Free) Lite Tier Account

Project Setup Steps

  1. Create an IBM Cloud Lite Tier Account
  2. Create a Watson Studio Instance
  3. Provision Watson Machine Learning & Cloud Object Storage Instances
  4. Create a New Project
  5. Download the Go Sample Dataset from the Gallery
  6. Unzip the Go Sample Dataset's .zip File
  7. Add the Go Sample Datasets to the Project

Project Setup

1. Create an IBM Cloud Lite Tier Account

2. Create a Watson Studio Instance

Alt Text
Alt Text

3. Provision Watson Machine Learning & Cloud Object Storage Instances

Alt Text
Alt Text

4. Create a New Project

Alt Text
Alt Text

5. Download the Go Sample Dataset from the Gallery

Alt Text
Alt Text
Alt Text
Alt Text

6. Unzip the Go Sample Dataset's .zip File

Alt Text

7. Add the Go Sample Datasets to the Project

Alt Text

In Tutorial A of this Think Lab, you will learn how to join several data sources related to a fictional outdoor store named Go, then build an experiment that uses the data to train a machine learning experiment. You will then deploy the resulting model and use it to predict daily sales for each product Go sells.

Joining data also allows for a specialized set of feature transformations and advanced data aggregators. After building the pipelines, you can explore the factors that produced each pipeline.

About the Data

Alt Text

The data you will join contains the following information:

  • Daily_sale: the GO company has many retailers selling its outdoor products, the daily sale table is a timeseries of sale records where the DATE and QUANTITY column indicate the sale quantity and the sale date for each product in a retail store.
  • Products: this table keeps product information such as product type and product names.
  • Retailers: this table keeps retailer infor mation such as retailer names and address.
  • Methods: this table keeps order methods such as Via Telephone, Online or Email
  • Go: the GO company is interested using this data to predict its daily sale for every product in its retail stores. The prediction target column is QUANTITY in the go table and DATE column indicates the cutoff time when prediction should be made.

Steps Overview

This tutorial presents the basic steps for joining data sets then training a machine learning model using AutoAI:

  • Add and join the data
  • Train the experiment
  • Deploy the trained model
  • Test the deployed model

Think Lab - Tutorial A Steps

  1. Create a New AutoAI Experiment
  2. Build the Data Join Schema
  3. Update the AutoAI Experiment Settings
  4. Run the AutoAI Experiment
  5. Explore the Holdout & Training Data Insights
  6. Deploy the Trained Model
  7. Score the Model
  8. View the Prediction Results

Think Lab - Tutorial A: Build & Deploy a Data Join Experiment

1. Create a New AutoAI Experiment

Add a New AutoAI Experiment to the Project

Alt Text
Alt Text

Associate a Machine Learning Service Instance

Alt Text
Alt Text
Alt Text
Alt Text

Select the Go Sample Datasets

Alt Text
Alt Text

2. Build the Data Join Schema

The main source contains the prediction target for the experiment. Select go_1k.csv as the main source, then click Configure join.
Alt Text

In the data join canvas you will create a left join that connects all of the data sources to the main source.

Use the Data Join Table to Build the Schema

Alt Text

Drag from the node on one end of the go_1k.csv box to the node on the end of go_products.csv.
Alt Text

In the panel for configuring the join, click (+) to add the suggested key product_number as the join key.
Alt Text

Repeat the data join process until you have joined all the data tables.
Alt Text
Alt Text
Alt Text

The Completed Data Join Schema Should Look Like This:

Alt Text

Choose Quantity as the column to predict.
Alt Text
AutoAI analyzes your data and determines that the Quanity column contains a wide range of numeric information, making this data suitable for a regression model. The default metric for a regression model is Root Mean Squared Error (RMSE).

Note:

  • Based on analyzing a subset of the data set, AutoAI chooses a default model type: binary classification, multiclass classification, or regression. Binary is selected if the target column has two possible values, multiclass if it has a discrete set of 3 or more values, and regression if the target column is a continuous numeric variable. You can override this selection.
  • AutoAI chooses a default metric for optimizing. For example, the default metric for a binary classification model is Accuracy.
  • By default, ten percent of the training data is held out to test the performance of the model.

3. Update the AutoAI Experiment Settings

Click Experiment settings

Alt Text

Click the Join Tab on the Data sources Page

Alt Text

Enable the Timestamp Threshold

In the main data table, go_1k.csv, choose Date as the Cutoff time column and enter dd/MM/yyyy as the date format. No data after the date in the cutoff column will be considered for training the pipelines. Note: the data format must exactly match the data or an error results.

Alt Text
In the data table go_daily_sales.csv, choose Date as a timestamp column so that AutoAI can enhance the set of features with timeseries related features. Enter dd/MM/yyyy as the date format. Note: The data format must exactly match the format in the data source or you will get an error running the experiment.

Specify the Runtime Settings

Alt Text
After defining the experiment, you can allocate the resources for training the pipelines. Click Runtime to switch to the Runtime tab. Increase the number of executors to 10. Click Save settings to save the configuration changes.

4. Run the AutoAI Experiment

Alt Text
Alt Text
Alt Text

5. Explore the Holdout & Training Data Insights

Alt Text
Alt Text

6. Deploy the Trained Model

Click Save as and Select Model

Alt Text

Click Create

Alt Text

Create a New Deployment Space

Alt Text
Alt Text
Alt Text
Alt Text

Add the Go Sample Datasets to the Deployment Space

Alt Text
Alt Text

Promote the Trained Model to the Deployment Space

Alt Text
Alt Text

Associate a Machine Learning Service Instance w the Deployment Space

Alt Text
Alt Text
Alt Text

Deploy the Trained Model

Alt Text

Create a New Batch Deployment

Alt Text
Alt Text

7. Score the Model

To score the model, you create a batch job that will pass new data to the model for processing, then output the predictions to a file. Note: For this tutorial, you will submit the training files as the scoring files as a way to demonstrate the process and view results.

Create a New Batch Job

Alt Text
Alt Text
Alt Text
Alt Text

Add the Scoring Files

Alt Text

You will see the training files listed. For each training file, click the Edit icon and choose the corresponding scoring file.
Alt Text

WARNING : Schema mismatch. The column types in this data asset do not match the column types in the Model Schema. Click Continue to select anyway.
Alt Text

Add go-sales-predictions.csv as the Ouput file name.
Alt Text

Run the Batch Job

When the uploads are complete, click Create to run the job.
Alt Text

View the Batch Job

Alt Text

Wait for the Batch Job to Complete

Starting...
Alt Text

Running...
Alt Text

Completed
Alt Text

8. View the Prediction Results

Download go-sales-predictions.csv to view the prediction results.
Alt Text

View the go-sales-predictions.csv prediction results in Excel.
Alt Text

Tune in Next Week for Tutorial B: AutoAI Data Join Multi-Classification

In Tutorial B of this Think Lab, you will use IBM AutoAI to automate data analysis for a dataset collected from a fictional call center. The objective of the analysis is to gain more insight into factors that impact customer experience so that the company can improve customer service. The data consists of historical information about customer interaction with call agents, call type, customer wireless plans, and call type resolution.

Top comments (0)