DEV Community

Overview of Training the Model using SageMaker

See my previous post about Overview of Building a Model using SageMaker from here

After selecting a model, the next step of course is training it.

Image description

And to be perfectly honest, with smaller data sets, you can just train it in your notebook instance. There's no reason that you have to go beyond it. But especially as you get bigger or if you need a more codified, repeatable way with many permutations, it makes sense to go to Amazon's training part of the platform.

This is a separate section, once again assessable through the core dashboard. So assuming that you're working with a reasonably medium or large sized data set, your notebook won't be enough. And by clicking at the training, you're presented with a lot of options,

Image description

but at its core what this does, is it creates a distributed compute cluster temporarily, to do the training and store the artifacts when done.

If you remember, any resource usage on Amazon, starts to incur a meter, much like a taxi and the training section through spinning up many expensive resources and then automatically deleting them, allows you to just use exactly what you need, to get a good chunk of computing done.

So within the job settings of course, first things first, you have to set a name and security, but then you also have to select a model. Now you could use one of their many pre-made models, use one that you've made yourself or also go to their marketplace, which allows you to either use a free one or pay a slight subscription fee to use somebody else's pre-made one. And after selecting the model or maybe going with a subscription from the marketplace, you're asked to describe how your job will scale.

Image description

Basically, what type of instance do you wanna put behind it? How many instances and how long do you want it to run? This is more of a DevOps question but even as a data scientist or a data engineer, it really helps to have a base understanding of what the right infrastructure to run your jobs is, because it'll allow you to make intelligent decisions about how big of an instance and what the implications are. After that, you're asked to start to tune hyper parameters.

Image description

Now, this is very model specific and can get very confusing. Of course, using a notebook to explore different hyper parameter implications upfront, is extremely helpful. But just know that these parameters really define how the model will behave at a high-level concept, such as, how many clusters it should attempt to make. And finally of course, you need to select where the data's coming from

Image description

and for this you're most likely actually going to select S3, particularly if you did a previous training job with GroundTruth or with your notebook.

However, there's other places it could live. This section's pretty straightforward, especially if you've done the other steps of making good label training data. Just make sure it can access this bucket, and very importantly, it can access the output path because that's where the results of the training job will be stored. AKA, when the training job is done and the model is fully described and then can be imported by other parts of the job, it's actually stored in S3 between steps.



GitHub
LinkedIn
Facebook
Medium

Latest comments (0)