Amazon Personalize allows developers with no prior machine learning experience to easily build sophisticated personalization capabilities into their applications. With Personalize, you provide an activity stream from your application, as well as an inventory of the items you want to recommend, and Personalize will process the data to train a personalization model that is customized for your data.
In this tutorial, we will be using the MovieLens dataset which is a popular dataset used for recommendation research. We will be using the MovieLens 25M Dataset under the Recommended for new research section. This contains 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users. The scripts and code referenced in this tutorial can be found in my github repository.
Download the respective zip file and navigate to the folder where the zip is stored and run the unzip command. You may need to install the unzip package if not already installed from this link. For example on ubuntu sudo apt-get install -y unzip
.
$ cd datasets/personalize
$ unzip ml-25m.zip
Important Note on Pricing
Depending on the personalize recipe used and the size of the dataset, this can result in a large bill when training a personalize solution. I learnt this the hard way by not reading the AWS Personalize billing docs properly which resulted in this exercise costing me over $100.
So I thought I would share the ways in which one could mitigate this and what to look out for when configuring the training solution.
For the purpose of this tutorial, I have used the MovieLens 25M Dataset. However, one could sample a smaller dataset from this or use the MovieLens Latest Datasets recommended for education and development, which is a lot smaller (100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users).
Secondly, It should be noted that if the model complexity needs better configuration, AWS Personalize will automatically scale up to suitable instance. This means that more compute resource will be used to complete your jobs faster and hence result in a larger bill.
The training hours can be broken down to the following components:
- A training hour represents 1 hour of compute capacity using 4v CPUs and 8 GiB memory
- The number of training jobs created for the solution if HPO is enabled
If one has enabled hyperparameter optimization (HPO) or tuning, the optimal hyperparameters are determined by running many training jobs using different values from the specified ranges of possibilities as described in the docs. In this tutorial, I have used HPO tuning with the following configuration for the training solution:
"hpoResourceConfig": {
"maxNumberOfTrainingJobs": "16",
"maxParallelTrainingJobs": "8"
}
The maxNumberOfTrainingJobs
indicates that you have maximum training job set to 16. Each will need own resource. In another words, the 560 training hours are a result of 16 training jobs as well as larger compute resource.
I was wondering how to reduce the cost for future solutions, so i contacted AWS Technical Support. They recommended the following:
- Disable HPO, if you want to tune your model, build something that works first then optimise later. One can check that HPO is enabled or disabled by running
aws personalize describe-solution <arn>
- Go over our cheat sheet provided by the Service Team.
There is also no way to force AWS Personalize to use a specific instance type, when it scales to complete the training job faster. The best cost optimisation method would be to turn off HPO in this scenario. Once the model is trained, there is no extra cost for keeping the model active after training. The ACTIVE
status will only be shown when training is complete and it does not necessarily mean the training is ACTIVE
as described in the docs.
Loading data into S3
Create an S3 bucket named recommendation-sample-data
and run the following command in the cli to enable Transfer Acceleration for the bucket.All Amazon S3 requests made by s3 and s3api AWS CLI commands can now be directed to the accelerate endpoint: s3-accelerate.amazonaws.com
. We also need to set the configuration value. use_accelerate_endpoint
to true
in a profile in the AWS Config file. For further details, please consult the AWS docs
$ aws s3api put-bucket-accelerate-configuration --bucket recommendation-sample-data --accelerate-configuration Status=Enabled
$ aws configure set default.s3.use_accelerate_endpoint true
In addition, to Transfer Acceleration, this AWS article recommends using the CLI for uploads for large file sizes, as it automatically performs multipart uploading when the file size is large. We can also set the maximum concurrent number of requests to 20 to use more of the host's bandwidth and resources during the upload. By default, the AWS CLI uses 10 maximum concurrent requests.
$ aws configure set default.s3.max_concurrent_requests 20
$ aws s3 cp datasets/personalize/ml-25m/ s3://recommendation-sample-data/movie-lens/raw_data/ --region us-east-1 --recursive --endpoint-url https://recommendation-sample-data.s3-accelerate.amazonaws.com
Finally we need to add the glue script and lambda function to S3 bucket as well. This assumes the lambda function is zipped as in lambdas/data_import_personalize.zip
and you have a bucket with key aws-glue-assets-376337229415-us-east-1/scripts
. If not adapt the query accordingly. Run the following commands from root of the repo
$ aws s3 cp step_functions/personalize-definition.json s3://recommendation-sample-data/movie-lens/personalize-definition.json
$ aws s3 cp lambdas/trigger_glue_personalize.zip s3://recommendation-sample-data/movie-lens/lambda/trigger_glue_personalize.zip
If you have not configured transfer acceleration for the default glue assets bucket, then you can set to false before running cp
command as below. Otherwise, you will get the following error:
An error occurred (InvalidRequest) when calling the PutObject operation: S3 Transfer Acceleration is not configured on this bucket
$ aws configure set default.s3.use_accelerate_endpoint false
$ aws s3 cp projects/personalize/glue/Personalize_Glue_Script.py s3://aws-glue-assets-376337229415-us-east-1/scripts/Personalize_Glue_Script.py
CloudFormation Templates
The CloudFormation template for creating the resources for this example is located in this folder. The CloudFormation template personalize.yaml
creates the following resources:
- Glue Job
- Personalize resources (Dataset, DatasetGroup, Schema) and associated Role
- Step Function for orchestrating the Glue and Personalize DatasetImport Jobs and creating a Personalize Solution
- Lambda function and associated Role, for triggering step function execution with S3 event notification.
We can use the following cli command to create the template, with the path to the template passed to the --template-body
argument. Adapt this depending on where your template is stored. We also need to include the CAPABILITIES_NAMED_IAM
value to --capabilities
argument as the template includes IAM resources (e.g. IAM role) which has a custom name such as a RoleName
$ aws cloudformation create-stack --stack-name PersonalizeGlue \
--template-body file://cloudformation/personalize.yaml \
--capabilities CAPABILITY_NAMED_IAM
If successful, we should see the following resources successfully created in the resources tab.
If we run the command as above, just using the default parameters, we should see the key value pairs listed in the parameters tab as in screenshot below.
We should see that all the services should be created. For example navigate to the Step function console and click on the step function name GlueETLPersonalizeTraining
S3 event notifications
We need to configure S3 event notifications for the training and prediction workflows. For the Training workflow, we need to configure s3 to lambda notification when raw data is loaded into S3, to trigger the step function execution. For the prediction workflow (batch and realtime), the following configurations are required:
- S3 to Lambda notification for triggering Personalize Batch Job when batch sample data object but into S3 bucket prefix
- S3 to Lambda notification for triggering lambda to transform output of batch job added to S3.
- S3 notification to SNS topic, when the output of lambda transform lands in S3 bucket. We have configured email as subscriber to SNS via protocol set as email endpoint, via Cloudformation. The SNS messages will then send email to subscriber address when event message received from S3.
To add bucket event notification for starting the training workflow via step functions, run the custom script and passing arg --workflow
with value train
. By default, this will send S3 event when csv file is uploaded into movie-lens/batch/results/
prefix in the bucket.
$ python projects/personalize/put_notification_s3.py --workflow train
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
INFO:__main__:Lambda arn arn:aws:lambda:........:function:LambdaSFNTrigger for function LambdaSFNTrigger
INFO:__main__:HTTPStatusCode: 200
INFO:__main__:RequestId: X6X9E99JE13YV6RH
To add bucket event notification for batch/realtime predictions run the script and pass --workflow
with value predict
. The default prefixes set for the object event triggers for s3 to lambda and s3 to sns notification, can be found in the source code. These can be overridden by passing the respective argument names (see click options in thesource code.
$ python projects/personalize/put_notification_s3.py --workflow predict
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
INFO:__main__:Lambda arn arn:aws:lambda:us-east-1:376337229415:function:LambdaBatchTrigger for function LambdaBatchTrigger
INFO:__main__:Lambda arn arn:aws:lambda:us-east-1:376337229415:function:LambdaBatchTransform for function LambdaBatchTransform
INFO:__main__:Topic arn arn:aws:sns:us-east-1:376337229415:PersonalizeBatch for PersonalizeBatch
INFO:__main__:HTTPStatusCode: 200
INFO:__main__:RequestId: Q0BCATSW52X1V299
Note: There is currently no support for notifications to FIFO type SNS topics.
Trigger Workflow for Training Solution
The lambda function and step function resources in the workflow should already be created via Cloudformation. We will trigger the workflow, by uploading the raw dataset into S3 path, for which S3 event notification is configured to trigger lambda and invoke the function. This will execute the state machine, which will run all the steps defined in the definition file.
Firstly, it will run the glue job to transform the raw data to required schema and format for importing interactions dataset into Personalize. The outputs from the glue job are stored in a different S3 folder to the raw data.
It will then import the interactions dataset into the Personalize. A custom dataset group resource and interactions dataset is already created and defined, when creating the Cloudformation stack.
Wait for solution version to print an ACTIVE status. Training can take a while, depending on the dataset size and number of user-item interactions. If using AutoMl this can take longer. The training time (hrs) value is based on 1 hr of compute capacity (default is 4CPU and 8GiB memory). However, as discussed in Pricing section of this blog, AWS Personalize automatically chooses a more efficient instance type to train the data in order to complete the job more quickly. In this case, the training hours metric computed will be adjusted and increased, resulting in a larger bill.
Analysing Traces and Debugging with AWS X-Ray
To diagnose any faults in execution, we can look at the x ray traces and logs. You can now view the service map within the Amazon CloudWatch console. Open the CloudWatch console and choose Service map under X-Ray traces from the left navigation pane. The service map indicates the health of each node by colouring it based on the ratio of successful calls to errors and faults. Each AWS resource that sends data to X-Ray appears as a service in the graph. Edges connect the services that work together to serve requests as detailed here. In the center of each node, the console shows the average response time and number of traces that it sent per minute during the chosen time range. A trace collects all the segments generated by a single request.
Choose a service node to view requests for that node, or an edge between two nodes to view requests that traversed that connection. The service map disassociates the workflow into two trace ids, for every request, with the following groups of segments:
- lambda service and function segments
- step function, glue, personalize segments
You can also choose a trace ID to view the trace map and timeline for a trace. The Timeline view shows a hierarchy of segments and subsegments. The first entry in the list is the segment, which represents all data recorded by the service for a single request. Below the segment are subsegments. This example shows subsegments recorded by Lambda segments. Lambda records a segment for the Lambda service that handles the invocation request, and one for the work done by the function as described here. The function segment comes with subsegments for the following phases:
- Initialization phase: Lambda execution environment is initialised.
- Invocation phase: function handler is invoked.
- Overhead phase: dwell time between sending the response and the signal for the next invoke.
For step functions, we can see the various subsegments corresponding to the different states in the state machine.
Evaluating solution metrics
You can use offline metrics to evaluate the performance of the trained model before you create a campaign and provide recommendations. Offline metrics allow you to view the effects of modifying a solution's hyperparameters or compare results from models trained with the same data. To get performance metrics, Amazon Personalize splits the input interactions data into a training set and a testing set. The split depends on the type of recipe you choose. For USER_SEGMENTATION recipes, the training set consists of 80% of each user's interactions data and the testing set consists of 20% of each user's interactions data.For all other recipe types, the training set consists of 90% of your users and their interactions data. The testing set consists of the remaining 10% of users and their interactions data.
Amazon Personalize then creates the solution version using the training set. After training completes, Amazon Personalize gives the new solution version the oldest 90% of each user’s data from the testing set as input. Amazon Personalize then calculates metrics by comparing the recommendations the solution version generates to the actual interactions in the newest 10% of each user’s data from the testing set as described here
You retrieve the metrics for a the trained solution version above, by running the following script, which calls the GetSolutionMetrics operation with the solutionVersionArn
parameter.
python projects/personalize/evaluate_solution.py --solution_version_arn <solution-version-arn>
2022-07-09 20:31:24,671 - evaluate - INFO - Solution version status: ACTIVE
2022-07-09 20:31:24,787 - evaluate - INFO - Metrics:
{'coverage': 0.1233, 'mean_reciprocal_rank_at_25': 0.1208, 'normalized_discounted_cumulative_gain_at_10': 0.1396, 'normalized_discounted_cumulative_gain_at_25': 0.1996, 'normalized_discounted_cumulative_gain_at_5': 0.1063, 'precision_at_10': 0.0367, 'precision_at_25': 0.0296, 'precision_at_5': 0.0423}
The metrics above are summarised from the AWS docs below:
- coverage: An evaluation metric that tells you the proportion of unique items that Amazon Personalize might recommend using your model out of the total number of unique items in Interactions and Items datasets. To make sure Amazon Personalize recommends more of your items, use a model with a higher coverage score. Recipes that feature item exploration, such as User-Personalization, have higher coverage than those that do not, such as popularity-count
- mean reciprocal rank at 25: An evaluation metric that assesses the relevance of a model’s highest ranked recommendation. Amazon Personalize calculates this metric using the average accuracy of the model when ranking the most relevant recommendation out of the top 25 recommendations over all requests for recommendations. This metric is useful if you're interested in the single highest ranked recommendation.
- normalized discounted cumulative gain (NCDG) at K: An evaluation metric that tells you about the relevance of your model’s highly ranked recommendations, where K is a sample size of 5, 10, or 25 recommendations. Amazon Personalize calculates this by assigning weight to recommendations based on their position in a ranked list, where each recommendation is discounted (given a lower weight) by a factor dependent on its position. The normalized discounted cumulative gain at K assumes that recommendations that are lower on a list are less relevant than recommendations higher on the list. Amazon Personalize uses a weighting factor of 1/log(1 + position), where the top of the list is position 1. This metric rewards relevant items that appear near the top of the list, because the top of a list usually draws more attention.
- precision at K: An evaluation metric that tells you how relevant your model’s recommendations are based on a sample size of K (5, 10, or 25) recommendations. Amazon Personalize calculates this metric based on the number of relevant recommendations out of the top K recommendations, divided by K, where K is 5, 10, or 25. This metric rewards precise recommendation of the relevant items as described here
In the second part of this blog, we will create a campaign with the deployed solution version and set up API Gateway with Lambda for the generating real time.
Top comments (0)