We all know that Python can do a lot with machine learning, but did you know you can write Python code to take advantage of Microsoft Azure's cloud-based compute, storage, and automation capabilities?
Let's take a look at how the Azure ML Python SDK lets you register datasets, train models using automated machine learning, evaluate the performance of those models, register them for future use, and even deploy them as REST endpoints for others to use - all in around 100 lines of code.
This content is also available in video form on YouTube
Automated machine learning refers to the task of selecting a data sets and high level goal such as predicting car prices or determining if a mole is likely to be cancerous or not. Automated ML then automates the selection of a specific machine learning algorithm and hyperparameters for that algorithm by trying as many different algorithms that might work as it can in a window of time and narrowing in on the best performing ones.
This helps new data scientists significantly reduce the learning curve by allowing them to focus on the core task they want to perform instead of memorizing the available algorithms and their hyperparameters. It can also help more experienced data scientists find better performing algorithms they may not have considered.
Automated ML is available on the web in Azure Machine Learning Studio, but it's faster and easier to share with others by running your experiments directly from Python code using the Azure ML Python SDK.
Let's take a look at how a typical experiment works.
The first thing we need to do is connect to an Azure Machine Learning Workspace. We do this by downloading a
config.json file to our local directory and then calling
Workspace.from_config() to connect to that workspace:
# Load the workspace information from config.json using the Azure ML SDK from azureml.core import Workspace ws = Workspace.from_config() ws.name
Next, we need to either get a pre-provisioned compute cluster we've set up or create a new one:
from azureml.core.compute import ComputeTarget, AmlCompute from azureml.core.compute_target import ComputeTargetException # Now let's make sure we have a compute resource created cluster_name = "My-Cluster" # The name of the cluster vm_size = 'Standard_D2DS_V4' # There are many different specs available for CPU or GPU tasks. min_nodes = 0 # This is important to prevent billing while idle max_nodes = 4 # Azure does limit you to a certain quota, but you can get that extended # Fetch or create the compute resource try: cpu_cluster = ComputeTarget(workspace=ws, name=cluster_name) # This will throw a ComputeTargetException if this doesn't exist print('Using existing compute: ' + cluster_name) except ComputeTargetException: # Create the cluster print('Provisioning cluster...') compute_config = AmlCompute.provisioning_configuration(vm_size=vm_size, min_nodes=min_nodes, max_nodes=max_nodes) cpu_cluster = ComputeTarget.create(ws, cluster_name, compute_config) # Ensure the cluster is ready to go cpu_cluster.wait_for_completion(show_output=True)
This allows us to use an Azure-based compute cluster for our machine learning tasks and only pay for the time we use.
More Details: See my detailed article on managing Azure compute resources with the Python SDK for more on retrieving and creating compute resources.
Next, we'll take a CSV file, load it into a Pandas dataframe, and register it as a tabular dataset in Azure.
from azureml.core import Dataset import pandas as pd # The default datastore is a blob storage container where datasets are stored datastore = ws.get_default_datastore() # Load some data into a dataframe (Note: Pandas is just one path into Azure ML) df = pd.read_csv('my_data.csv') # Register the dataset ds = Dataset.Tabular.register_pandas_dataframe( dataframe=df, name='DataSet-Name', description='A description of my dataset', target=datastore ) # Display information about the dataset print(ds.name + " v" + str(ds.version) + ' (ID: ' + ds.id + ")")
More Details: See my detailed article on registering datasets in Azure with the Python SDK for more on this process.
This registers the dataset in Azure, or adds a new version to an existing dataset. Many experiments can be done in the future with this same dataset.
After this comes the fun part. We need to configure how our machine learning experiment should behave:
# Create the configuration for the experiment from azureml.train.automl import AutoMLConfig # See https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig?view=azure-ml-py for details automl_config = AutoMLConfig( task='regression', # regression, classification, or forecasting training_data=ds, # The data to use to train the model label_column_name='thingIWantToPredict', # The column we're trying to predict n_cross_validations=3, # How many cross-validation sets to use primary_metric='normalized_root_mean_squared_error',# The metric we use to compare model performance compute_target=cpu_cluster, # Where the experiment should be run max_concurrent_iterations=max_nodes, # How many models can be trained simultaneously iterations=40, # The total number of models to train iteration_timeout_minutes=5 # The amount of time before giving up on a single model training run )
We select the task we're trying to accomplish (regression or classification, typically), give Azure a dataset and tell it which column we want to predict and which machine learning metric is the most important to us when comparing two models.
We can then configure information on cross validation, how many models to try, the compute resource to use, and how many different nodes in the cluster should be activated.
Once we have this configured, we create and submit our experiment:
from azureml.core.experiment import Experiment from azureml.widgets import RunDetails # Find or Create a Machine Learning Experiment in Azure Machine Learning Studio experiment_name = 'My-Regression-Experiment' experiment=Experiment(ws, experiment_name) # Start running the experiment run = experiment.submit(automl_config) # Wait for the experiment to complete (displays active details about the run) RunDetails(run).show() run.wait_for_completion(show_output=False)
This causes Azure to run our experiment and compare the various models it generates until it finds the best performing model based on the validation criteria we specified.
The Azure ML Python SDK also includes some widgets that will actually show you the progress of a running machine learning experiment so you can monitor it directly in a Jupyter Notebook in your IDE.
More Details: See my detailed article on confusion matrixes for classification to help understand common metrics in machine learning.
Once a run completes, we can grab the best performing model and details on that run and get access to all of the metrics associated with it.
# Grab the resulting model and best run best_auto_run, automl_model = run.get_output() # Display details about the best run print('Best Run: ' + str(best_auto_run.id)) RunDetails(best_auto_run).show()
From there, we can register this model in Azure so it can be formally deployed in an Azure Container Instance or on Azure Kubernetes Service.
Alternatively, we can also download the files associated with the model to use outside of Azure.
# Register the model in Azure best_auto_run.register_model( model_name='My-AutoML-Model', model_path='outputs/model.pkl', description='A model I trained with Python Code') best_auto_run.download_files(output_directory='automl-output')
If we like a model, we can deploy it directly from code as either an Azure Container Instance or an Azure Kubernetes Service with several different authentication and scaling options available.
from azureml.core import Environment from azureml.core.model import InferenceConfig # Load the environment from the YAML file downloaded from the best run env = Environment.from_conda_specification("AutoML-env", "automl-output/outputs/conda_env_v_1_0_0.yml") # Create an inference config pointing at the files we downloaded. This configuration tells Azure how to handle requests inference_config = InferenceConfig(environment=env, source_directory='./automl-output/outputs', entry_script='./scoring_file_v_2_0_0.py') # The deployment configuration configures how the endpoint is hosted deployment_config = AciWebservice.deploy_configuration( cpu_cores = 1, memory_gb = 1, enable_app_insights=True, auth_enabled=False) # Deploy the model service = Model.deploy(ws, "endpoint-name", [automl_model], inference_config, deployment_config, overwrite=True) service.wait_for_deployment(show_output = True) # Grab our scoring endpoint for testing print('Endpoint active at ' + service.scoring_uri)
Hopefully this is interesting to you. Stay tuned for more content on machine learning on Azure.