DEV Community

Julien Simon for AWS

Posted on • Originally published at Medium on

Doctor Alice and Cloud Native Bob, my favorite Machine Learning users

Alice and Bob are great, which is just as well because I meet them everywhere I go! They’re passionate, hardworking people who try their best to build great Machine Learning solutions. Unfortunately, a lot of things stand in their way, and slow them down.

It’s my job to listen to and help all the Alices and Bobs out there :) In this post, I’ll try to summarize the challenges that they’re facing, how they can start solving them with AWS services in general, and Amazon SageMaker in particular.

Tell me about your first GPU cluster.

Believe it or not, I promise that everything you read below is 100% based on numerous customer interactions.

Doctor Alice

Alice has a PhD and works in a large public research lab. She’s a trained data scientist, with a strong background in math and statistics.

She spends her time on large scientific projects (life sciences, physics, etc.), involving bulky data sets (several terabytes, often more). She can write data analysis code in high-level languages like Matlab, R and Python: maybe it’s not great looking, but it does the trick.

Alice generally doesn’t know much about IT and infrastructure, and she honestly doesn’t care at all for these topics. Her focus is on advancing her research, publishing papers, and so on: everything else is “just a tool”.

For her daily work, she can rely on her own powerful (but expensive) desktop workstation: high-end Intel CPU, 64GB RAM, 4TB of local storage, and a couple of mid-end NVIDIA GPUs. She considers herself lucky to have one (not all her colleagues do), and she enjoys the fact that she can work “on her own”. Still, she can only experiment with a fraction of her dataset if she wants to keep training times reasonable.

She runs long workloads every night and every weekend, hoping that they won’t crash… or that no one will turn the power off in her office. When that happens, she sighs and just launches the job again. “What of waste of time and power”, she thinks, but there’s nothing she can do about it.

She tries to maintain the software configuration of her machine herself, as IT doesn’t know much about the esoteric tools she uses. Still, she wishes someone would do that for her: NVIDIA drivers, Tensorflow dependencies and so on feel quite confusing. When something goes wrong, she wastes precious hours fixing “IT stuff” and that’s frustrating.

When Alice wants to run large experiments, she has to use remote servers hosted in the Computing Centre: a farm of very powerful multi-GPU servers, connected to a Petabyte of NAS storage. Of course, she has to share these servers with other researchers. Every week, the team leads meet and try to prioritize projects and workloads: this is never easy, and decisions often need to be escalated to the lab director. Valuable time is wasted, and sometimes experiments cannot complete in time for conference submission deadlines. The director promises that next year’s budget will account for more servers, but even if they get approved, it will take many months to procure and deploy them.

To try and help with workload scheduling, Alice has hired a summer intern to write a simple intranet portal where researchers can manually reserve servers. It kind of works, but sometimes her colleagues forget to release servers… or maybe they’re reluctant to release them because they know they’ll have to wait to get another one. Alice thinks that there must be a better way, but it’s not her job to fix that. Anyway, she tries to make do with what she can get her hands on. She can’t help but think that she’d make much more progress if her experiments weren’t capacity-bound.

Last but not least, Alice has been invited to collaborate with a world-class lab located on another continent. For the last month, she’s been trying to figure out how to share data and infrastructure with them. Connectivity is complicated, and data transfers are almost impossible given the size of the data sets. Again, “IT stuff” stands in the way of her research, and she’s starting to think that there must be a quicker and easier way to do all of this… Maybe that cloud computing thing can help?

Doctor Alice in SageMaker land

After some personal research and weeks of internal discussions, Alice has convinced the lab director to let her experiment with AWS, specifically with that Amazon SageMaker service that looks interesting.

In just a few hours, she’s read the online documentation, quickly created an inexpensive notebook instance and started running some sample notebooks to become familiar with the service and its typical workflow. Even though she doesn’t know much about AWS or Python, she’s confident that the SageMaker SDK is all that she needs, and that she’ll be up to speed in no time. She even found a sample notebook on how to create a custom environment for her beloved R language!

With a little help from IT, Alice uploads a few Terabytes of real-life data to Amazon S3, in order to replicate her desktop environment on Amazon SageMaker. After more reading, she learns that a few lines of code are all it takes to train her model on managed instances, and that she’ll only pay for what she actually uses. She can even train her models on as manyp3.16xlarge instances as she needs: each one is equipped with eight NVIDIA V100 GPUs! That’s the same configuration as those large servers in the Computing Center. She can now create them on demand: no more arguing with other teams.

Thanks to advanced features such as Distributed Training and Pipe Mode, Alice finds it easy to train models on her large data sets: everything works out of the box and scales nicely. Alice is also happy to see that SageMaker includes an Automatic Model Tuning module: thanks to this, she’s able to significantly improve the accuracy of her models in just a few hours of parallel optimization. Doing this with her previous setup would have been impossible due to lack of computing resources.

Deploying models is straightforward: Alice has a choice of deploying models to real-time endpoints, or of running batch predictions. The latter is most useful to her, as she needs to infrequently predict vast amounts of data. Again, all it takes is a couple of lines of code: she actually copy-pasted everything from a sample notebook.

Last but not least, keeping track of her expenses is easy: the AWS console tells her how much she’s spent, and Alice can also set up budget alerts.

Talking to her IT colleagues, Alice realizes that there’s so much more to AWS. It looks like the lab could use a service called AWS Snowball to easily upload hundreds of Terabytes to Amazon S3! And it doesn’t look really complicated to share that data with other labs and universities who are already AWS customers. Alice is looking forward to the increased pace of collaboration and innovation!

In just a week or so, her whole view on IT has changed. “This cloud stuff is actually very cool”, she says. “IT used to stand in the way, but now AWS is really helping me deliver better results, quicker and cheaper than before”.

Cloud Native Bob

Bob works in a large company. He’s a backend software engineer, and he’s been working with data for as long as he can remember: SQL, NoSQL, Hadoop, Spark and now Machine Learning. Using off the shelf libraries (mostly scikit-learn, and a bit of Tensorflow), Bob and his teammates crunch enterprise data to train hundreds of ML models: linear regression, classification, segmentation, etc. The heavy lifting (ETL, cleaning, feature engineering) is run a large Spark cluster.

Bob’s company has been using AWS for more than five years. All development, QA and production infrastructure are hosted there, and they’re big fans of DevOps and Cloud Native technology. Initially, they built everything on Amazon EC2, because they felt that having full control was important. Two years ago, they moved their Spark cluster to Amazon EMR, and also decided to standardize production with Docker containers: 100% of their workloads are now deployed to a large Amazon EKS cluster, because Kubernetes is “so cool” and also because of the seamless development experience from laptop to production. They’ve set up all their CI/CD toolchain according, and automated everything with Terraform. Cost is optimized with Auto Scaling and Spot Instances. These guys really know their stuff, and they’ve been asked to present their architecture at the next AWS Summit. Hell yeah.

Of course, their ML workloads also run on EKS clusters (mostly on CPU instances, with some GPU). Bob maintains bespoke containers for training and prediction: libraries, dependencies, etc. That takes a bit of time, but he enjoys doing it. He just hopes that no one will ask him to do Pytorch and Apache MXNet too. Bob has heard about the Deep Learning AMI and Deep Learning containers. Checking that stuff out is on his to-do list, but he doesn’t have time right now.

Initially, Bob wanted to let every data scientist create their own on demand cluster with a Terraform template, but he became concerned that costs would be hard to manage. Moreover, creating a cluster would take 15–20 minutes, which the team would definitely have complained about. Instead, Bob has set up a large shared training cluster: users can start their distributed jobs in seconds, and it’s just simpler for Bob to manage a single cluster. Auto Scaling is set up, but capacity planning is still needed to find the right mix of Reserved, Spot and On Demand instances.

Bob has a weekly meeting with the team to make sure they’ll have enough instances… and they also ping him on Slack when they need extra capacity on the fly. Bob tries to automatically reduce capacity at night and on weekends when the cluster is less busy, but he’s quite sure they’re spending too much anyway. Oh well.

Once models have been trained and validated, Bob pushes them to CI/CD and they get deployed automatically to the production cluster. Green/blue deployment, service discovery, Auto Scaling, logs pushed to an ElasticSearch cluster, etc.: the whole nine yards. Bob is very proud of the level of automation he’s built into the system, and rightly so.

Cloud Native Bob in SageMaker land

Bob has watched an AWS video on SageMaker. He likes the fact that training and prediction are based on Docker, and that the containers for the built-in frameworks are open source: no need to maintain his own containers, then.

Migrating the training workloads to SageMaker looks easy enough: Bob could get rid of the EKS training cluster, and let every data scientist train completely on demand. With Spot Instances coming soon, Bob could certainly optimize training costs even more.

As far as prediction is concerned, Bob would consider moving the batch jobs to SageMaker, but not the real-time APIs: these must be deployed to EKS, as per company policy. That’s not a problem, as he can reuse the same containers.

A few weeks later, Bob has implemented all these changes. The data science team is now fully autonomous from experimentation to training, which they love: capacity is never a problem anymore, and the weekly meeting has become unnecessary. Also, as SageMaker provides a traceable history of trained models, Bob also finds it easier to automate model retraining and redeployment: previously, he’d have to ask the data science team to do it, the process was too uncertain for him to handle.

Advanced features like Distributed Training, Pipe Mode and Automatic Model Tuning are also pretty sweet. The data science team has adopted them very quickly, and this saves them a lot of time. Plus, they no longer have to maintain the kludgy code they had written to implement something similar.

There’s only one problem: now, Bob has to update his AWS Summit slides to account for the SageMaker use case. Maybe the local Evangelist can help him with that? ;)

As always, thank you for reading. Happy to answer questions here or on Twitter (DMs are open). Feel free to share your own anecdotes too!

I raise my horn to the Alices and Bobs of the world. Keep building, my M(eta)L brothers and sisters :)

Top comments (0)