DEV Community

Julien Simon for AWS

Posted on • Originally published at Medium on

Making Amazon SageMaker and TensorFlow Work for You

This is a guest post by Chaim Rand , Machine Learning Algorithm Developer at Mobileye . It builds upon the AIM410R session at AWS re:Invent 2019.

Abstract

Under the surface of Mobileye’s (officially known as “Mobileye, an Intel Company”) life-saving driving assistant products are cutting edge AI technologies. At any given time at Mobileye, we may be training scores of Deep Neural Networks (DNN) targeted for the next generation of Advanced Driving Assistant, Autonomous Vehicle, and Road Experience Management products.

This requires vast amounts infrastructure that is fast, flexible, scalable, and secure. Enter Amazon SageMaker. In this post, I will share some of the details of how we adapted one of our DNNs to SageMaker’s Pipe Modeand the surprising ways in which this accelerated the development cycle.

Prelude

This post is about the Amazon SageMaker service. It is an exciting story about how my team and I ported one of our Deep Learning (DL) training flows to Sagemaker, the challenges we encountered along the way, how we overcame them, and the benefits we discovered along the way. It is a story of courage, creativity, and, above all, perseverance.

Amazon SageMaker - Accelerating Machine Learning | Amazon Web Services

Chapter 1: Introduction

The audience I am targeting includes:

  1. Developers trying to decide whether SageMaker is right for them or their company.
  2. Developers who have been tasked with porting their training code to SageMaker and don’t know where to begin or what to expect.
  3. Developers who are knee deep in SageMaker.

The message that I want to deliver to you today, whichever group you may fall into is that… everything is going to be okay.

Let me start by saying that I am NOT an AWS guy. On the one hand, what that means is that I don’t speak on behalf of AWS. Any allusions that I may make regarding performance or cost are based solely on my own experience and should be verified by your own AWS representative. On the other hand, that means that I am one of you. I am here for you. Feel free drop me a line sharing your AWS woes, or as I like to call them… your wAWS. I promise to be supportive.

Here are some of the things I like about SageMaker:

  • It offers a secure and scalable environment in which one can essentially spin up as many training sessions as they want.
  • It enables one to freely choose between many different types of training instances with ease.
  • It enables feeding one’s training data directly from Amazon S3, essentially removing any storage space constraints.
  • It enables one to decouple the storage of their training data from the actual training execution.
  • It enables one to run their entire development pipeline in the cloud , from data collection and creation all the way to quantization and deployment.

However, as with any other new framework, adopting SageMaker might require some patience, resilience, and effort.

Make no mistake, the SageMaker documentation is quite good. The APIs are pretty straightforward and there are code samples demonstrating a wide variety of use cases.

At the end of the day, adopting our flow to SageMaker did not require much heavy lifting. While we did face some challenges along the way, we overcame them and gained more and more confidence that Sagemaker could work for us. Over the next few minutes I hope to pass this confidence to you. Our story is based on using TensorFlow with the SageMaker PipeModeDataset , but I believe that most of what I have to say carries over to any solution based on Pipe Mode usage. When you are using large data sets, using Pipe Mode is the “right” way to train on SageMaker. Of course, this is just my opinion… but it’s true. (Legal disclaimer, what I mean is that ‘I think it’s true’, but it really is!).

My story is told based on TensorFlow version 1.13.1 and SageMaker version 1.23. To the best of my knowledge, my comments are correct as of today (November 2019). Naturally, things may have changed since then.

Chapter 2: Sagemaker Pipe Mode

What is Pipe Mode and what is it good for?

An introduction to Pipe Mode

Pipe input mode is one of the main features offered by the SageMaker training environment, and it is said to enable meaningful reductions in both train time and cost. Pipe Mode is a mechanism (based on Linux pipes) for streaming your training data directly from Amazon S3 storage to your training instance.

Accelerate model training using faster Pipe mode on Amazon SageMaker | Amazon Web Services

The previous way of doing this was to download all of the data from S3 to the training instance. This had to be done each time you wanted to spin up a new training session. When working with large data sets,say 10s or even 100s of Terabytes, this would cause a significant delay to the training start time. You may also incur significant storage costs, again, for each training instance.

Pipe Mode avoids this by essentially feeding the data directly to the algorithm as it is needed. This means that training can start as soon as the pipe is opened and no local storage is required.

In particular, this has the effect of removing any limitations on the size of your data set. You can store all of your data on S3, with its virtually limitless storage capacity, and not have to worry about local storage constraints or costs.

This means that your data storage and training environment are now decoupled. You can spin up as many training instances as you’d like and have them all point to same data storage location in S3.

from sagemaker.tensorflow import TensorFlow

tensorflow = TensorFlow(
 entry\_point=’myscript.py’, 
 input\_mode=’Pipe’,
 …)

train\_data = ‘s3://sagemaker-path-to-train-data’

tensorflow.fit({‘train’:train\_data})
Enter fullscreen mode Exit fullscreen mode

There is one significant drawback to Pipe Mode. In our pre-SageMaker work flow, we were accustomed to random access control over our data set. In other words, we were able to freely access any sample within our data set, and we counted on this ability in order to control how data was fed to the training pipeline; to control appropriate shuffling of the input data, enable boosting of certain subsets of data, and more. In the next chapters I will go into this in more detail and describe how we solved this.

Lest you should be thinking to yourself, “Linux pipes? Come on… Now I have to manage those?!!”.

SageMaker comes with an implementation of the TensorFlow Dataset interface that essentially hides all the low level from you.

This provides support for all the TensorFlow operations (preprocessing, boosting, shuffling, etc.) that your heart may desire, and feeds directly into the training pipeline. Nirvana…

def parse(record):

feature = {‘label’: tf.FixedLenSequenceFeature([], tf.int64, allow\_missing=True),

‘image\_raw’: tf.FixedLenFeature([], tf.string)}

features = tf.parse\_single\_example(record, feature)

image = tf.decode\_raw(features[‘image\_raw’], tf.uint8)

label = features[‘label’]

return {“image”: image}, label # This is what will be fed into your model

ds = PipeModeDataset(“train”, record\_format=’TFRecord’)

ds = ds.apply(map\_and\_batch(parse, batch\_size=32, num\_parallel\_batches=2))

return ds
Enter fullscreen mode Exit fullscreen mode

Picking a file format

But there was a catch. A fairly significant one. The catch was that we would need to transform all of our training data into one of the data formats supported by PipeModeDataset, which include: text records, TFRecord and Protobuf. Of course, we could choose to use Pipe Mode with our existing data format, but to enjoy the goodness of PipeModeDataset, and save ourselves the headache of implementing the pipe management ourselves, we would need to adopt one of the above formats.

We chose TFRecord, (really not for any other reason than the abundance of sample code available). For those wary of adopting a new data format, I will mention that TFRecord is TensorFlow’s binary storage format and that converting your data to TFRecord should be fairly simple (examples are abundant online).

For us, the need to transform our data set format actually turned out to be a huge blessing in disguise. Faced with the need to modify our data creation flow, we embarked on a quest to port this stage of the workflow to AWS as well.

This ultimately led to an enormous acceleration in our data creation time (from several days to a couple of hours) and thus to our overall development time.

Chapter 3: Data Preparation… in the Cloud

Having adopted Sagemaker pipe input mode and the TFRecord format, you now need to ensure that your training data is prepared accordingly. Let’s go over what that means.

Splitting the data set

In the standard usage of pipe input mode, we set up a pipe by providing an S3 prefix. When the pipe is opened, all of the files that match the given prefix are fed one by one into the pipe. The size of the files may impact the performance of the pipe. File sizes that are too small or too big will almost certainly slow down your training cycle. After a bit of experimentation (and consulting our trusted AWS rep), we settled on a target file size of 100 Megabytes. Thus our first requirement was that the training data be broken down into TFRecord files of roughly 100 Megabytes each.

Shuffling the data set

A common practice in the world of Machine Learning (ML) is to shuffle your data before training. In the past, we relied on our ability to randomly access any sample in our data set to ensure appropriate shuffling. However, given the sequential nature of pipe input mode, we could no longer rely on this. Thus, our second requirement was that the training data be appropriately shuffled during preparation.

When you have massive amounts of data, as we do, the task of preparing your data can be quite daunting and time consuming.

Fortunately, we were able to leverage the nearly infinite scale opportunities offered by the AWS Batch service in order to accomplish this in a highly parallel and very efficient manner.

To ensure a sufficiently random shuffling, we employed a two step process. The first step performed the initial parsing and recording of the data records, and the second step grouped the records into 100 Megabyte TFRecord files in a random fashion. I will not dive any further into the details, as they are pretty use case specific. I will only note that the pay-per-second and spot fleet support that AWS Batch offers can help in reaching cost efficiency.

You will likely need to separate your data into different groups, for example train and test. This is done by using a different prefix for the train and test data and then setting up corresponding pipes in the SageMaker start up script.

Chapter 4: Data Shuffling

We have ensured that the data in S3 is shuffled, but in some cases, we want to reshuffle the data before each data traversal (epoch). This can be accomplished quite trivially when you have access to your full data set, but when it comes to pipe mode with its inherently sequential nature, the solution for this is not immediate.

In order to address this need, we can use Sagemaker’s ShuffleConfig class to set up each pipe such that before each data traversal, the order in which the files are fed into the pipe is shuffled. We chose to add an additional level of shuffling, that would include shuffling at the training batch level , using the TensorFlow Dataset shuffle function. This function, which is applied to the PipeModeDataset, receives a shuffle window size that causes each successive record to be randomly chosen from the next “window size” elements on the pipe. The window size we chose was dictated by the number of records in each file while taking care not to add too much memory overload on the application.

train\_data = s3\_input(
   ‘s3://sagemaker-path-to-train-data‘,
   shuffle\_config=ShuffleConfig(seed)
)
Enter fullscreen mode Exit fullscreen mode

The solution above does not give us the same degree of shuffling that we used to have. For example, two records that appear in the same TFRecord file are more likely to appear in close vicinity of one another than at two opposite ends of the data stream. But the three levels of shuffling that I have described (during data creation, ShuffleConfig and TensorFlow shuffle) were more than sufficient for our purposes.

Chapter 5: Managing Your Training Data

Now may be a good time to mention that there is a limitation to the number of pipes you can set up. As of this writing, this limitation stands at 20 pipe channels. You might be asking yourself: “Why the heck would I need any more than twenty? Why would anyone need more than two“. There are often times where we want to separate our data into different subsets and manipulate the data differently during train time.

Using pipes to boost under-represented classes

Let me attempt to demonstrate this via the following (made-up) example.

Suppose you are tasked with creating a DNN that identifies cars on the road. You are given 100,000 marked frames and have transformed these into TFRecord files as described above. Now suppose you run a few rounds of training and find that your resultant network now succeeds in identifying most cars pretty well, but consistently fails to identify pink cars. You go back to your training data and realize that there is no wonder that you are failing to learn pink cars as you only have 10 training records with pink cars. The solution that you want to attempt is to “boost” the pink cars in your input pipe, meaning, during each epoch, you will feed the 10 records with pink cars twice. If you had free access to your entire data set that would be a fairly simple task. But how do you do it using pipes?

We need more pink cars!

A terrible solution (again… only my opinion… but unequivocally true) would be to duplicate the ten records in your data set in S3:

  1. This approach could potentially and needlessly blow up the size of your data set.
  2. I can almost guarantee that one day later you will decide that the correct boost rate is 3 not 2. Or is it 5?

An alternative solution is to create a dedicated training pipe for pink cars and then in the preprocessing phase of the training interleave between the PipeModeDataset corresponding to the pink cars and the PipeModeDataset corresponding to the rest of the train records, with their corresponding appropriate weights, before feeding them to the network. (One way to do this is using the TFRecord sample_from_datasets routine.)

ds = tf.contrib.data.sample\_from\_datasets(datasets, weights)
Enter fullscreen mode Exit fullscreen mode

Now you might say to yourself : “Great! I have a bunch of free pipes, I’ll use one of them for pink cars”. But a few days later, you realize that you need a different boost parameter for pink trucks, and a different one for black cars at night… and before you know it you have hit the limit.

Before I get into some of the ways that we addressed this issue, I would like to give another example where having multiple pipes can be very useful.

Using pipes for data augmentation

A common practice in ML is to artificially increase your training data set by performing data augmentations. In the olden days, we would apply each one of a fixed set of augmentations to each data record and feed it to the network while ensuring appropriate shuffling. Again, we relied on our access to the full data set, which we did not have when moving to Sagemaker Pipe Mode.

One appealing solution was to randomize the augmentation for each input record. However, some networks required us to fix the augmentation type and ensure that each augmentation was applied to each of the records. Another solution could have been to create all of the different augmentations ahead of time. But, once again, this would have been very wasteful and would not have enabled us to play with the augmentation parameters.

We chose to address this requirement by creating N parallel training pipes, where N was the number of different augmentation types. Each corresponding PipeModeDataset was implemented with the corresponding augmentation function, following which all of the pipes were interleaved together before being fed to the network. In this case, it was extremely important to use the ShuffleConfig object we discussed above, to increase the likelihood that the different augmentations of a given record would be spread out rather than bunched together.

Keeping pipes under control

Now that you are convinced that there may be situations in which we need more pipes than we are allotted, I will describe one solution we used for decreasing the number of pipes.

One of the alternatives to configuring pipes with an S3 prefix (as described above), is to create and point to a SageMaker manifest file.

Provide Dataset Metadata to Training Jobs with an Augmented Manifest File

In a manifest file, you explicitly point to the list of files that you want to feed into the network. In particular, if there are certain files that you want to be traversed twice, you can simply write them in the manifest file twice. This is a very useful solution for use cases in which we have more boost rates than allotted pipes. It does not solve the multiple pipes needed for augmentations.

data = s3\_input(
   ‘s3://path-to-manifest-file‘, 
   s3\_data\_type=’ManifestFile‘, 
   shuffle\_config=ShuffleConfig(seed)
)
Enter fullscreen mode Exit fullscreen mode

Let me summarize some of the tips we covered:

  • Try to group your data so that you will not need more that the maximum number of pipes. If you can’t, consider using manifest files.
  • Use the ShuffleConfig setting to shuffle the order of the input files before each traversal.
  • Use the TensorFlow shuffle for additional shuffling at the file level.

Obviously, the details of you own implementation, and whether any of the tips above apply, will depend on the specifics of your use case.

Chapter 6: Multi-GPU training

One of the advantages to using SageMaker is our ability to freely choose a training instance to match our current training job.

Amazon SageMaker Instance Types - Amazon Web Services (AWS)

In particular, we can choose one or more machines with one or more GPUs. There is no shortage of documentation on the different methods and strategies for using multiple GPUs to speed up training. There are multiple considerations that one should take into account when choosing the ideal training instance. (If you aren’t already, you should start by using the SageMaker metrics to view the GPU, CPU and memory utilizations.) There are also multiple ways of adjusting one’s code to multi-GPU training. I wish only to briefly demonstrate how one’s decision to use the SageMaker framework, and, in particular, SageMaker pipe input mode, may bear on some of the decisions regarding multi-GPU implementation.

Setting up multi-GPU training

For some of our training jobs, we found it appropriate to perform data parallelization over multiple GPUs on a single instance , to speed up training. There were two primary libraries we considered for implementing this.

Launching TensorFlow distributed training easily with Horovod or Parameter Servers in Amazon SageMaker | Amazon Web Services

  • Built-in multi-GPU TensorFlow support: If you are using TensorFlow estimators, then this is a very attractive option as it boils down to just adding a few lines of code, setting the appropriate strategy in a tf.estimator.RunConfig.
  • The Horovod distributed training framework: Horovod enables you to easily add a wrapping layer to your training code, that controls the number of training instances threads and ensures appropriate data sharing (gradient sharing) between them. SageMaker supports Horovod configuration directly.

aws-samples/sagemaker-horovod-distributed-training

Multi-GPU and Pipe Mode

There is one significant difference in the way these two solutions work. While TensorFlow opens a single input stream which is shared by all GPUs, Horovod wraps the entire training script, including the data input flow. This means that if you are using Horovod to train on an instance with 8 GPUs, you will need to configure 8 times as many pipes as on a single GPU job. Given the limitation on pipes that we mentioned above, you could see how using Horovod may incur some limitations.

Naturally, performance should be the number one consideration when deciding which path to choose (and as we saw, we can sometimes work around the pipe limitation). We found the performance of both frameworks on our DNN to be comparable, and we chose the TensorFlow option due to the pipe limitation.

Optimizing training times

One last tip regarding multi-GPU training before we move on. It is quite common to run training with multiple GPUs, and to run evaluation on a single GPU. Now suppose that your evaluation takes an hour. If you are running evaluation intermittently during training, you will find yourself spending hours utilizing only one GPU on your multi-GPU instance. This is an unforgivable waste of resources, not to mention a huge waste of money.

Consider the following instead. Each time you want to run evaluation, spin up a new single GPU instance on Amazon EC2 from within your training session and launch the evaluation there. Yes, this works, provided that you have installed the SageMaker SDK, which surprisingly is not there by default.

This has the added benefit of reducing the delay to your training (which doesn’t have to wait for evaluation to complete before resuming), and it might be a good idea even in the single GPU case.

Chapter 7: Debugging on Sagemaker

I wish I could tell you, dear reader, that once you have transformed your data, configured your training, ensured appropriate shuffling session, and overcome any pipe number limitations, everything will work perfectly. But, alas, as with most everything in life, certainly in the world of SW development such is not the case.

As always, you are likely to experience crashes, exceptions, training failures and other woes. Just, that now, the usual difficulties of debugging and solving such issues are compounded by the fact that your are running on a remote environment.

So, here is golden rule number one… and if you take nothing away from this blog but this, my time will have been well spent.

Always start by running your training session in your local environment

You can stick to a very small subset of your data — even just one or two batches before running on SageMaker. This will save you lots of time (and money). It’s as simple as that.

Use the Amazon SageMaker local mode to train on your notebook instance | Amazon Web Services

The problem is that not all issues can be reproduced this way. Some issues are environment specific, other issues are related to Pipe Mode, which (as of now) cannot be run locally, and yet other issues (such as lack of loss convergence), only come up when training on a large amount of data. Here are some pointers that you might find helpful:

  1. The SageMaker logs (which can be accessed from the console) is probably the first thing to check. There you will get an initial indication if something went wrong, and if so, what.
  2. If you suspect you may be facing an issue with Pipe Mode (e.g. low throughput), the first thing you should do is open a ticket to AWS support. You could try to add tf.print and what not to try and find the root cause, but the Pipe Mode mechanism is a feature that we do not have much visibility into.
  3. Use TensorBoard to track the performance of your training. You can configure TensorBoard (from the command line) to point directly to the S3 model directory, or (if you have many events) download the event file periodically and run locally.
  4. Use the console to track CPU and GPU utilization metrics. Advanced users can add custom metrics (such as training loss), trigger alarms and apply other CloudWatch techniques.
  5. If you think of additional debugging features that would help you and the community at large, don’t hesitate to submit a feature request!

Chapter 8: Using Spot Instances on Sagemaker

Recently, AWS announced support for training in SageMaker on Spot Instances.

Managed Spot Training: Save Up to 90% On Your Amazon SageMaker Training Jobs | Amazon Web Services

Spot Instances let you take advantage of unused compute capacity in the cloud, allowing you to significantly reduce cost. The catch, of course, is that if the machine is suddenly needed by a customer willing to pay the full price, your compute (in our case your training session) will be terminated midway and your training instance will be taken away from you. The good news is that SageMaker will restart your training session as soon as a new Spot Instance is available. Of course, there is no guarantee how long that might take.

The opportunity to reduce cost is quite compelling.

Still, imagine training for a day or two or three, only to have your instance terminated on your last epoch!! Imagine the gut-wrenching, blood-curling despair.

Of course, there is an easy solution for that, and that is to periodically store checkpoints of your model during training.

If your training algorithm is halted midway, the job simply resumes from the latest stored checkpoint.

This is extremely straightforward when using TensorFlow estimators, which automatically searches for an existing checkpoint in your model directory when it starts up. All that is left for you to do, is to decide on the frequency at which you want to store checkpoints.

Using Checkpoints in Amazon SageMaker

But there is another, somewhat more delicate thing to consider. Suppose, you have the wild misfortune of having your Spot Instance terminated ten consecutive times, right after you have traversed precisely the first fifth of your data. The net effect is that you have trained your network (for ten epochs) on precisely a fifth of your data. You have not seen the rest of the data at all. You could see why that would be a problem as your model will biased towards the data it has seen.

Ideally, you would like to return to the exact location you were at before you were terminated (or more accurately, the location where the last checkpoint was saved), but this is not possible (today) in Pipe Mode.

This problem is alleviated when you use the ShuffleConfig class as we described above. This will ensure that each time the training restarts, it will start from a different location and on a different ordering of the data. This is likely to prevent the danger of developing a bias towards a subset of your data.

My non-binding advice would be to definitely take advantage of Spot Instances to reduce cost, but perhaps consider keeping your critical sessions on regular (non-spot) instances.

Chapter 9: Summary

With that, I have come to the end of my story; the story of how we made SageMaker work for us.

Yes, as with the adoption of any new development environment, we had to go through some hoops and hurdles, especially given the scale at which we operate. We got some unexpected benefits, and I hope I have convinced you that SageMaker can work for you too!

Top comments (0)