DEV Community

Suman Debnath for AWS

Posted on • Edited on

Machine Learning in SQL Style (Part-1)

imgh1

Machine learning(ML) is everywhere, you look around, you will see some or the other application is either built using ML or powered by ML. And with the advent of technology, specially cloud, every passing day ML is getting more and more reachable to developers, irrespective of their background. We at Amazon Web Services(AWS) are committed to put machine learning in the hands of every developer, data scientist and expert practitioner. Now, what if you can create, train and deploy a machine learning model using simple SQL commands?

During re:Invent 2020 we announced Amazon Redshift ML which makes it easy for SQL users to create, train, and deploy ML models using familiar SQL commands. Amazon Redshift ML allows you to use your data in Amazon Redshift with Amazon SageMaker(a fully managed ML service), without requiring you to become experts in ML.

Now, before we dive deep into what it is, how it works, etc. here are the things we will try to cover in this first part of the tutorial:

  • What is Amazon Redshift
  • Introduction to Redshift ML
  • How to get started and the prerequisites
  • I am a Database Administrator - What's in for me ?

And in the Part-2, we will take that learning beyond and cover the following:

  • I am a Data Analyst - What's about me ?
  • I am a Data Scientist - How can I make use of this ?

Overall, we will try to solve different problems which will help us to understand Amazon Redshift ML from a perspective of a database administrator, data analyst and an advanced machine learning expert.

Before we get started and set the stage by reviewing what is Amazon Redshift?

Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehousing service on the AWS. Its low-cost and highly scalable service, which allows you to get started on your data warehouse use-cases at a minimal cost and scale as the demand for your data grows. It uses a variety of innovations to obtain very high query performance on datasets ranging in size from a hundred gigabytes to a petabyte or more. It uses massively parallel processing(MPP), columnar storage and data compression encoding schemes to reduce the amount of I/O needed to perform queries, which allows it in distributing the SQL operations to take advantage of all available resources underneath.

Let's quickly go over few core components of an Amazon Redshift Cluster:

img1

Client Application

Amazon Redshift integrates with various data loading and ETL (extract, transform, and load) tools and business intelligence (BI) reporting, data mining, and analytics tools. As Amazon Redshift is based on industry-standard PostgreSQL, most of commonly used SQL client application should work, we are going to use Jetbrains DataGrip to connect to our Redshift cluster(via JDBC connection) later while we jump into the hands-on section. Having said that, you may like to use any other SQL Client tool like SQL Workbench/J, psql tool, etc.

Cluster

The core infrastructure component of an Amazon Redshift data warehouse is a cluster. A cluster is composed of one or more compute nodes. A cluster comprises of nodes, as shown in the above image, Redshift has two major node types: leader node and compute node.

Leader Node

If we create a cluster with two or more no. of compute nodes, then an additional leader node coordinates the compute nodes and handles external communication. We don't have to define a leader node, it will be automatically provisioned with every Redshift cluster. Once the cluster is created, the client application interacts directly only with the leader node. In other words, the leader node behaves as the gateway(the SQL endpoint) of your cluster for all the clients. Few of the major tasks of the leader node is to store the metadata, coordinate with all the compute nodes for parallel SQL processing and and to generate most optimized and efficient query plan.

Compute Nodes

The compute nodes is the main workhorse for the Redshift cluster, and it sits behind the leader node. The leader node compiles code for individual elements of the execution plan and assigns the code to individual compute node(s). After that, the compute node(s) execute the respective compiled code and send intermediate results back to the leader node for final aggregation. Each compute node has its own dedicated CPU, memory, and attached storage, which are determined by the node type.

The node type determines the CPU, RAM, storage capacity, and storage drive type for each node. Amazon Redshift offers different node types to accommodate different types of workloads, so you can select which suits you the best, but its is recommended to use ra3. The new ra3 nodes let you determine how much compute capacity you need to support your workload and then scale the amount of storage based on your needs.

Ok, now that we understood a bit about the Redshift Cluster let's go back to the main topic, Redshift ML :)
And don't worry if things are still dry for you, as soon as we jump into the demo and create a cluster from scratch, things will fall in place.

Introduction to Redshift ML

We have been integrating ML functionality with many other services for long time, for example in re:Invent 2019, we announced Amazon Aurora Machine Learning, which enables you to add ML-based predictions to your applications via the familiar SQL programming language. Integration with ML is very important in today's world we live in. It helps any developer to build, train, and deploy your ML models efficiently and at scale.

Following the ritual, during re:Invent 2020, we announced this new capability called Redshift ML, which enables any SQL user to train, build and deploy ML models using familiar SQL commands, without knowing much about machine learning. Having said that, if you are an intermediate machine learning practitioner or an expert Data Scientist, you still get the flexibility to define specific algorithms such as XGBoost and specify hyperparameter and preprocessor.

The way it works is pretty simple, you provide the data that you want to train the model and metadata associated with data inputs to Amazon Redshift and then Amazon Redshift ML creates the model that capture patterns in the input data. And once the model is trained, you can then use the models to generate predictions for new input data without incurring additional costs.

As of now, Amazon Redshift supports supervised learning, that includes the following problem types:

  • regression: problem of predicting continuous values, such as the total spending of customers

  • binary classification: problem of predicting one of two outcomes, such as predicting whether a customer churns or not

  • multi-class classification: problem of predicting one of many outcomes, such as predicting the item a customer might be interested

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value.

The inputs used for the ML model are often referred to as features or in ML terms, called independent variables, and the outcomes or results are called labels or dependent variables. Your training dataset is a table or a query whose attributes or columns comprise features, and targets are extracted from your data warehouse. The following diagram illustrates this architecture.

img2

Now, we understand that data analysts and database developers are very much familiar with SQL, as they use that day-in day-out. But to build, train and deploy any ML model in Amazon SageMaker, one need to learn some programming language(like Python) and study different types of machine learning algorithms and build an understanding of which algorithm to use for a particular problem. Or else you may rely on some ML expert to do your job on your behalf.

Not just that, even if someone helped you to build, train and deployed your ML model, when you actually need to use the model to make some prediction on your new data, you need to repeatedly move the data back and forth between Amazon Redshift and Amazon Sagemaker through a series of manual and complicated steps:

  1. Export training data to Amazon Simple Storage Service (Amazon S3).
  2. Train the model in Amazon SageMaker.
  3. Export prediction input data to Amazon S3.
  4. Use prediction in Amazon SageMaker.
  5. Import predicted columns back into the database.

img3

All this is daunting, isn't it?

But now, with Amazon Redshift ML, we don't have to do any of these, you can train model with one single SQL CREATE MODEL command. So, you don't have to expertise in machine learning, tools, languages, algorithms, and APIs.

Once you run the SQL command to create the model, Amazon Redshift ML securely exports the specified data from Amazon Redshift to Amazon S3 and calls SageMaker Autopilot to automatically prepare the data, select the appropriate pre-built algorithm, and apply the algorithm for model training. Amazon Redshift ML handles all the interactions between Amazon Redshift, Amazon S3, and SageMaker, abstracting the steps involved in training and compilation.

And once the model is trained, Amazon Redshift ML makes model available as a SQL function in your Amazon Redshift data warehouse.

img4

Ok, let's see all in action now...

Create Redshift Cluster

Let's create a Redshift Cluster now, first we need to login to our AWS Console and search for Redshift and click on Create cluster.

img5

Next, in the Cluster configuration section, we need to provide some cluster identifier, let's say redshift-cluster-1 and select the appropriate node type and number of nodes you would like to have in the cluster. As mentioned before, we recommend to choose RA3 node types, like ra3.xplus, ra3.4xlarge and ra3.16xlarge which offers the best in class performance with scalable managed storage. For our demo we will select ra3.4xlarge node type and we will create the cluster with 2 such nodes.

img6

After than under Database configuration, we need to provide our database name, port number(where the database will accept the inbound connections), master user name and password.

img7

Next, we need to expand the Cluster permissions section and attached an IAM role. Since our cluster would use Amazon S3 and Amazon SageMaker later on, we need to provide adequate permission so that our Redshift cluster can access data saved in Amazon S3, and Redshift ML can access Amazon SageMaker to build and train the model. We have already created an IAM role namely, RedshiftMLRole. We can just select the right IAM role from the dropdown and click on Associate IAM role

img8

If you want to create an IAM role with a more restrictive policy, you can use the policy as following. You can also modify this policy to meet your needs.

img9

Also, if you would like to connect to this cluster from instances/devices outside the VPC via the cluster endpoint, you would need to enabled Public accessible option as bellow, but it is not recommended to enable Public accessible, in our demo we are going to use an Amazon EC2 Instance to connect to the cluster via SSH Tunneling:

img11

Just review all the configurations and click on Create cluster

img12

Connecting to Redshift Cluster

Next, we can use any tool of our choice to connect to our cluster and we are going to use Jetbrains DataGrip.

Now, if you have created the cluster with Public accessible enabled, then you can directly connect with the cluster, but since we created the cluster without public access, we are going to use Amazon EC2 Instance to connect to the cluster via SSH Tunneling as mentioned above. And for that we have already created an Amazon EC2 instance in the same region where we created our Redshift cluster and we are going to use the same instance to access the cluster via SSH Tunning.

But before we connect, we need to fist know the JDBC URL endpoint of our cluster, for that we can click on our cluster and copy the JDBC URL in our clipboard

img13

img14

Now, we can open Datagrip(or any tool of your choice and connect to the cluster) using the JDBC URL, user name and password which we have used while creating the cluster and text the connection.

img15

And then go to the SSN/SSL option to add the tunnel, this is the place where we need to mention the Amazon EC2 Instance details which we had created earlier and once that is done, we can click on Test Connection to test if everything is working fine or not.

img15

Ok, we are now all set to see Redshift ML all in action :)

Dataset

We are going to see 3 demos next showing different aspects and functionalities of Redshift ML, which will hopefully help you to get an understanding of different use cases and learn how you make use of Redshift ML irrespective of your background. You may be a Database Engineer/Administrator or Data Analyst or an advanced Machine Learning practitioner, we will cover different demo from the perspective of all these different personas.

First we need to make sure we upload the dataset on S3(we have uploaded all the dataset in our Amazon S3 bucket, redshift-downloads-2021). All the dataset can be found inside this GitHub repo

Exercise 1 (Database Engineer's perspective)

Dataset

In this problem, we are going to use the Bank Marketing Data Set from UCI Machine Learning Repository. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

The objective is to predict if the client will subscribe (yes/no) a bank term deposit (variable y).

The dataset consists of total 20 features/input variable and one class label/output variable.

Since our dataset is located in Amazon S3, first we need to load the data in table. We can open DataGrip(or whatever SQL Connector you are using) and create the schema and the table. Once that is done, we can use COPY command to load the training data from Amazon S3(bank-additional-full.csv) to the Redshift cluster, in the table, client_details

We need to make sure that colum names of the table matches with the feature sets in the CSV training dataset file.

img16

Similarly we can load the dataset for the testing(bank-additional-inference.csv) in a separate table, client_details_inference

img17

Users and Groups

Before we try to create the model, we need to make sure that the user is having the right permission. Just like how Amazon Redshift manages other database objects, such as tables, views, or functions, Amazon Redshift binds model creation and use to access control mechanisms. There are separate privileges for creating a model running and prediction functions.

Here are going to create 2 user groups, dbdev_group(users who will use the model for prediction) and datascience_group(users who will create the model) and within these groups we will have one user each, dbdev_user and datascience_user respectively.

img18

Next, we can grant the appropriate access/permission to the respective group and authorize the user datascience_user as the current user.

img19

Training (Model Creation)

Finally, now we are all set to create the model using a simple CREATE MODEL command, it will export the training data, train a model, import the model, and prepare an Amazon Redshift prediction function under the hood.

img20

Two things to note here:

  • The SELECT query above creates the training data(input features), i.e. all columns except column y
  • The TARGET clause specifies which column should be used as class label that the CREATE MODEL should uses to learn how to predict, i.e. the y column.

Behind the scene, Amazon Redshift will use Amazon SageMaker Autopilot for training. At this point, Amazon Redshift will immediately start to use Amazon SageMaker to train and tune the best model for this binary classification problem(as the output or class label can be either yes or no).

The CREATE MODEL command operates in an asynchronous mode and it returns upon the export of training data to Amazon S3. As the remaining steps of model training and compilation can take a longer time, it continues to run in the background.

But we can always check the status of the training using the STV_ML_MODEL_INFO function.

img21

Once the training is done, we can use SHOW MODEL ALL command to see all the models which we have access to:

img22

We can also see some more details about the model, e.g. model performance(like accuracy, F1 score, MSE, etc. depending on the problem type), model type, problem type, etc.

img23

Accuracy of the Model and Prediction/Inference

Now that we have the new SQL function, func_model_bank_marketing2, we can use the same function for prediction. But before we do so, let's first grant the appropriate access/permission to the dbdev_groupso that the dbdev_user can use the function for prediction. Once that is done we can change the authorization to dbdev_user as we expect that the prediction operation to be executed by the database engineer or the data analyst and not necessarily by only the data scientists in the organization.

img24

First let's try to see what's the accuracy of out model, using the test data which we have in the client_details_inference table.

img25

As we can see the accuracy is around 94%, which is not all that bad considering the small dataset we used for this problem, but we can see how easily we use simple SQL query to create, train and deploy our ML models using Redshift ML.

And finally let's try to do some prediction using this same model function

img26

In the Part 2 of this tutorial series we will try to cover few more advanced functionalities, and those would be from a Data Analyst or any expert Data Scientist viewpoint, wherein you can define many advanced options, like model type, hyperparameters, objective function, pre-processors, and so on.

But before we move on to the Part 2, let's spend some time to underhand the costs consideration using Redshift ML and how you can control it.

Cost and Redshift ML

As Amazon Redshift ML use the existing cluster resources for prediction, there is no additional Amazon Redshift charges. That means, there is no additional Amazon Redshift charge for creating or using a model,
and as prediction happens locally in your Amazon Redshift cluster, you don't have to pay extra.

But, as we learnt that Amazon Redshift ML uses Amazon SageMaker for training our model, which does have an additional associated cost.

The CREATE MODEL statement uses Amazon SageMaker as we have seen before, and that incurs an additional cost. The cost increases with thenumber of cells in your training data. The number of cells is proportional to number of records (in the training query or table) times the number of columns. For example, when a SELECT query of the CREATE MODEL statement creates 100,000 records and 50 columns, then the number of cells it creates is 500,0000.

One way to control the cost is by using two option MAX_CELL and MAX_RUNTIME in the CREATE MODEL command statement. Where MAX_RUNTIME specifies the maximum amount of time the training can take in SageMaker when the AUTO ON/OFF option is used. Although training jobs can complete sooner than MAX_RUNTIME, depending on the size of the dataset. But there are additional works which Amazon Redshift performs after the model is trained, like compiling and installing the model in your cluster. So, the CREATE MODEL command can take a little longer time then then MAX_RUNTIME to complete. This option can be used to limit the cost as it controls the time to be used by Amazon SageMaker to train your model.

Under the hood, when you run CREATE MODEL with AUTO ON, Amazon Redshift ML uses SageMaker Autopilot which automatically explores all the different models(or candidates) to find the best one. MAX_RUNTIME limits the amount of time and computation spent and if MAX_RUNTIME is set too low, there might not be enough time to explore even one single candidate. And you would get an error saying, "Autopilot candidate has no models" and in that case you would need to re-run the CREATE MODEL with a larger MAX_RUNTIME value.

One another way to control cost for training (which may not be a good idea always as it would affect the model accuracy), is by specifying a smaller MAX_CELLS value when you run the CREATE MODEL command. MAX_CELLS limits the number of cells, and thus the number of training examples used to train your model.

By default, MAX_CELLS is set to 1 million cells. Reducing MAX_CELLS reduces the number of rows from the result of the SELECT query in CREATE MODEL that Amazon Redshift exports and sends to SageMaker to train a model. Reducing MAX_CELLS thus reduces the size of the dataset used to train models both with AUTO ON and AUTO OFF. This approach helps reduce the costs and time to train models.

In summary, by increasing MAX_RUNTIME and MAX_CELLS we can often improve the model quality as it allows Amazon SageMaker to explore more candidates and it would have more training data to train better models.

What next...

So, in the tutorial we learnt a bit about, what Amazon Redshift is and how you can create, train and deploy a ML model using familiar SQL query from a Database Engineer's/Administrator's perspective, in the next part we will explore little bit more on how you can make use of Amazon Redshift ML if you are an advanced data analyst or a data scientist and shall explore some advanced options, which it has to offer.

Resources

Top comments (0)