Let's go through a detailed example where we will build an end-to-end pipeline, right from loading the data into Cloud Storage, creating BigQuery datasets over it, training the model using BigQuery ML, and testing it. In this use case, we will use a logistic regression model for finding the lead conversion probability.
The leads data contains various attributes about prospective customers. BigQuery ML has built-in functionality where we can directly train the model over any dataset. We can predict the output variable and the conversion probability. BigQuery provides an SQL interface to train and evaluate the machine learning model. This model can be deployed on the platform for consumption.
We have two datasets: leads training data and test data, where the training data is 80% of the actual overall data and the test data is 20%. Once the model is trained using the training data, we will evaluate the model on the test data and find the leads conversion probability for each prospect in the following categories:
- Junk lead
The following chart represents the end-to-end process of loading data into Cloud Storage and BigQuery, and training a model and testing it using the leads data. You can choose a dataset of your choice:
From the preceding diagram, we can see the following:
We have loaded the training and test dataset for leads into Cloud Storage buckets.
After loading data into Cloud Storage, we will create the leads dataset into BigQuery with two tables, namely, leads_training and leads_test .
Once the dataset is created, we will use the leads_training table to train our model and the leads_test table to test the model.
Let's discuss the step-by-step process to load data into Cloud Storage:
- You should have the training and test data.
Create a training and test bucket in Cloud Storage.
From the GCP Console, click on the navigation menu in the top left, and from the storage section click on Storage (Cloud Storage).
Click on Create a bucket at the top. You will see the following screen:
- Give a globally unique name to the bucket.
Choose a regional bucket for your use case.
Select the Location where you want to create the bucket.
Click on Create.
Upload the training and test data to their respective buckets by clicking on the bucket, and then either use the upload files option or drop the file into the
The following BigQuery code snippet will be used to train the leads model using logistic regression from the Leads_Training table:
CREATE MODEL `Leads.lead_model_optimum` OPTIONS (model_type = 'logistic_reg') AS SELECT Lead_Stage AS label, lead_origin, lead_source, ..., ..., ..., ..., receive_more_updates_about_our_courses, update_me_on_supply_chain_content, Get_updates_on_PGDMHBSCM, city_new, ..., ..., Asymmetrique_Activity_Score, Asymmetrique_Profile_Score, Last_Notable_Activity FROM Leads.Leads_Training_Data;
In BigQuery , you can use the ml.evaluate() function to evaluate any model. It will give results for that model. In the following code block are the BigQuery code and model evaluation results. Let's have a look at the following code:
SELECT * FROM ml.evaluate (model `Leads.lead_model_optimum`, ( SELECT Lead_Stage AS label, * FROM `Leads.Leads_Training_Data` ) )
In the preceding code, we have evaluated lead_model_optimum to find its details. Let's have a look at the following results, after executing the preceding query:
In BigQuery, the ml.predict() function is used to predict outcomes using the model. Execute the following BigQuery code to test your model:
SELECT prospect_id, predicted_label FROM ml.predict(model `Leads.lead_model_optimum`, ( SELECT * FROM Leads_Test_Data))
In the preceding code, based on prospect_id , the model is predicting Lead_Stage of test_data .
You can see the resulting screenshot. Please compare the model's prediction and the Lead_Stage column of test data based on prospect_id to see the accuracy of the
Book: Hands-on AI on GCP