DEV Community

Enterprise Machine Learning Best practices for AWS SageMaker

Most enterprises prefer cloud Data Science platforms. AWS SageMaker is an industry leader, and analysts recommended the Cloud Data Science platform. The platform offers state-of-the-art Machine Learning components, integrations, and MLOps. While enterprises are adopting SageMaker, it is essential to train our Data Scientists and ML Engineers on best practices. These best practices will help enterprises protect the data in motion (data in training) during the training phase and measure cost and ROI from the Data Science process localized to experiments and projects. This note will explore various SageMaker SDK options to secure, manage, and track experiments and inferences.

Tagging Training and Models

Tagging is a critical piece in public cloud infrastructure cost management, security, and resource management. As Data Scientists or ML Engineers tag may not sound something significant. But it is better to start adopting in the enterprise settings. Eventually, your team would be able to generate meaningful insights about the cost of training and inference from such a tag process. The team may have to work close to the Cloud Infrastructure and Operations (I&O) team to achieve this process.

A tag is a set of key-value pairs like JSON or Python dictionaries. A sample tag may look like

Project: Our Awsome ML Magic
Sponsor: The Cool Manager
Project Lead: The AI Geek
Project Name: The Magic Wand
Cost Center: TheSuperCreditCardwithNoLimit
Contact: we@ourcoolcompany.ai
Enter fullscreen mode Exit fullscreen mode

Let's see how to achieve tagging in a training job and deploy a model. First, we have to convert our tags to a Python dictionary or JSON.

my_tags = {'Project' : 'Our Awsome ML Magic',
'Sponsor' : 'The Cool Manager',
'Project Lead' : 'The AI Geek',
'Project Name' : 'The Magic Wand',
'Cost Center' : 'TheSuperCreditCardwithNoLimit',
'Contact' : 'we@ourcoolcompany.ai'}
Enter fullscreen mode Exit fullscreen mode

Now assign the tag to the tag parameter in your estimator API.

est = sagemaker.estimator.Estimator(
    container_image_uri,
    aws_role,
    train_instance_count=1,
    train_instance_type='ml.m5.xlarge',
    base_job_name="the-cool-ml-pipe",
    hyperparameters=hyperparameters,
    use_spot_instances=True,
    max_run=max_run,
    max_wait=max_wait,
    tags=my_tags,
)
Enter fullscreen mode Exit fullscreen mode

This process will add the tags to any AWS artifacts created by the training job. A training process spins a VM/container, and the compute is charged to the AWS account. With the help of tags, we could isolate the charge for each model experiment as well. Once the model is trained, we could use the deploy API and add the tag in the deployed model. Yes! the deploy API has tags parameter to achieve the same.

Security Groups and Subnets

Enterprise data and process is precious and worth protecting in all means. While creating SageMaker experiments, it is necessary to specify subnets and security groups in the configuration to skip any potential ravages hidden in the mysterious world. A security group filters any incoming and outgoing traffic in AWS resources. The estimator API accepts the security groups and subnets as a list. Irrespective of the nature of data sensitivity, it is advised to set the subnets and security_group_ids. By default, AWS runs your training job in a separate VPC if none is specified. The steps we discussed are an additional measure of security.

Encrypt Container Traffic

When using advanced algorithms such as Deep Learning, distributed training is inevitable. During distributed training inter container, traffic will be there. It is better to encrypt the traffic; remember, it can slightly delay your training process. The cost and delay impacted by the encryption are worth the security of your data. The developer needs to set the value of encrypt_inter_container_traffic as True (default, it is False).

Network Isolation of Containers

During the training or inference, we may not need access to the internet for any data. It is advised to store the data in an appropriate data store and reference it in the training script. By doing so, we can avoid any internet traffic to our training containers. In the Estimator API, we can specify the enable_network_isolation parameter as True.

Encryption

If the S3 buckets are encrypted for additional security with managed keys, we have to specify the keys in the Estimator API. The volume_kms_key and output_kms_key are the parameters to set. It is better to coordinate with your cloud team for policies and key usage. Every company has key management policies. Always remember never to expose your keys to open the internet!

Cost Optimization

SageMaker provides managed spot instances for warm start training instances and costs optimization. Enabling the spot instances is good practice as training is an iterative process; we can save some $$ for our SOTA models ;-). Details and examples of how to use spot instances in SageMaker is available at https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html

MLOps Innovation

These features may or may not be resonating in the minds of Data Scientists or ML Engineers as it is more related to Infrastructure. An innovative MLOps team can create automation or homegrown libraries for supporting Data Scientists and ML Engineers in simplifying the boring task.

Happy Hacking!

Discussion (0)