Automate provisioning of Sagemaker Notebooks using the AWS CDK

#aws #machinelearning #personalize #cdk

Alexa...set a timer for 15 minutes. ⏳

In this post I want to highlight two things:
1.) The announcement of the Amazon Personalize kickstart project
2.) Show you how to automate the provisioning of Sagemaker Notebooks for your data exploration tasks

🚀 What is the Amazon Personalize kickstart project?

The goal of this project is to provide you a kickstart for your personalization journey when building a recommendation engine based on Amazon Personalize. It will serve you as a reference implementation so you can both learn the concepts and integration aspects of Amazon Personalize.

You can also use it to build you own recommendation engine upon best practices and production-ready components based on the AWS Cloud Development Kit.

It will contain all of my collected best practices over the last year while building and designing recommendation engines. From implementing A/B testing strategies towards orchestrating and automating the training workflow of Amazon Personalize. But also providing a good developer experience using sandbox stacks or automate the provisioning of Sagemaker notebook instances.

The kickstart project is open source and available via Github:
https://github.com/cremich/personalize-kickstart/

💻 Wait, you are talking about an AI service. Why do I need to get in touch with Sagemaker?

Independent on which layer of the AWS ML stack you operate:
Before you just import your historical data, it is recommended to gather knowledge. Both on your data and on your business domain. Every recommendation engine project is kind of unique if we look at the data we have to process and the way how the business works.

Your process should start with defining the business problem you want to solve. Followed by defining KPIs you want to improve and framing your ML problem definition. Then start with data exploration and analysis.

A managed jupyter notebook by Amazon Sagemaker is an excellent start to

ingest and analyze you data
prepare, clean and transform your data,
start to train and tune your recommendation model candidates

The Amazon Personalize kickstart project supports you to automate the provisioning of individual Sagemaker Notebooks. Also ensures that a notebook is deleted once you delete your stack to save costs.

🚧 The Sagemaker notebook construct

The problem it solves

My observation from recent projects, if you do not automate this kind of stuff:

You kicked off your machine learning project and want to use AI services to solve this. Your team members want to do some data exploration and data analysis in the early days. Therefore everyone who owns this task, provisions a Sagemaker Notebook instance, some default IAM execution roles and a bunch of S3 buckets to store the data that needs to be explored.

Your project ends and usually those manual provisioned resources will be forgotten but still costs you money. They might also introduce some vulnerabilities due to outdated libraries until you stop your notebook sessions or restart the instances.

The solution it offers

The Sagemaker notebook construct provides an automatic provisioned Amazon Sagemaker notebook instance for your data analysis and data exploration tasks. Provisioning a Sagemaker Notebook is optional and not required in all stages and cases. In central provisioned dev, staging or production accounts a Sagemaker Notebook is not inevitably necessary.

But it will help you in your developer sandbox accounts or stacks. We encapsulate the resources that are needed to operate and run a Sagemaker notebook in a reusable construct.

The construct consists of the actual Sagemaker Notebook instance, an S3 bucket to put your raw data in as well as a IAM execution role. The role grants the Sagemaker service access to get and update data in the S3 bucket. It further includes the required managed policy to provide Sagemaker full access.

Construct parameters allow you to set the name of the notebook instance, the required EBS volume size as well as the required EC2 instance type.

⚠️ Please keep in mind that some properties will result in a resource replacement like changing the instance name. According to the cloudformation resource documentation, a change of the EBS volume size or the instance type won't replace your notebook instance.

Each developer that uses this construct will now benefit from an automated and consistent process while ensuring that all resources for data exploration are deleted once they are not needed anymore.

Sure there is always room for additional features if you for example have the requirements of a more fine grained networking setup. Consider this as a conceptual starting point and extend this concept to your needs.

Automate everything

The benefits of automating this process is, that you prevent to end in orphan resources. In my experience, it is likely to be forgotten over time that there are some notebooks still running or raw data still sleeping in S3 buckets.

So save your credit card and automate everything 😊

It also enables some other interesting options like uploading some initial data for data exploration or training along provisioning your construct. Or connecting a git repository to your notebook instance to provide some shared notebooks to all your data scientists.

Alexa says, time is over...see you next time. What additional ideas or options do you have in mind? What are your experiences how to automate this process? Happy to get your feedback, experience and thoughts in the comments. 👋