DEV Community

Setting Up SageMaker to Run Data Wrangler

We start using the data wrangler service from the SageMaker studio. The first thing we need to do is set the right notebook instance for the wrangler to run on. So, from the SageMaker studio, select create a notebook instance, give your instance a name, then the most important setting is to ensure that it has the right instance type. So, you need a minimum of an m5.4xlarge, instance to run the wrangler service.

We have the option to set elastic inference, which allows you to add inference acceleration to a hosted in-point for less cost than if you're using a full GPU instance.

Image description

So, choose that if you want that. We can limit the access rights. We can turn off root access using an IAM role, and we can turn encryption on or off.

We have a networking, git repositories and tag options as well.

Image description

Once we have our notebook instance running on a m5.4xlarge instance type, then we will be able to use the wrangler service.

Image description

Now, that take a little while to provision, as you will find with most of the services in SageMaker studio, when you're doing them for the first time. Don't be alarmed. It gets much quicker over time. Okay. We can see now that we've got one notebook instance in service. So, now we can go ahead and open the JupyterLab environment. First of all, let's just have a quick look at the actual Jupyter notebook itself. This is just the raw notebook without the SageMaker interface over the top of it. All right. Just gives us the basic info. Now, if we want to open the studio, we'll get our full SageMaker studio view.

Image description

All right. So here we are. There are a few ways you can start a new data flow either using the file new flow command, or we can access it from inside the project manager. So, a flow is a data wrangler basically, to start a new flow from the manager, Let's go file. And then new and new flow. Choosing a new flow is how we create a new data wrangle. And, first of all, we need to give it a name, something that we can recognize. We have these four stages, import, prepare, analyze, and export. So, the first thing we're going to do is wanting to create a data source. Again take a little while to establish the connection to the engine. So, be patient the first time you run this, you'll see here It's not quite available yet, it will be really in a few minutes, but we're ready to go now.



GitHub
LinkedIn
Facebook
Medium

Top comments (0)