DEV Community

Karthik Sakthivel
Karthik Sakthivel

Posted on

Amazon EKS support in Amazon SageMaker HyperPod to scale foundation model development

What's new at AWS πŸ“’

β˜‘ #Amazon EKS support in Amazon SageMaker HyperPod to scale foundation model development

β˜‘ This new availability enables customers to run and manage their Kubernetes workloads on SageMaker HyperPod, a purpose-built infrastructure for foundation model (FM) development which reduces time to train models by up to 40%.

β˜‘ Many customers use Kubernetes to orchestrate their ML workflows due to its portability, scalability, and rich ecosystem of tools. However managing hardware failures are not automated.

β˜‘ With this launch, customers can run deep health checks during cluster creation and automated hardware failures during ML trainings and fine-tuning.

β˜‘ In addition, HyperPod automatically replaces faulty nodes(self-healing performant clusters) and resumes training from the last checkpoint on both AWS Trainium and Nvidia GPU at a scale of more than a thousand accelerators.

β˜‘ EKS orchestrated HyperPod clusters also integrate with CloudWatch Container Insights to provide out-of-the-box observability of health status checks and visual dashboards.

β˜‘ Customer can use HyperPod CLI, or their preferred tools, to submit, manage, and monitor workloads.

β˜‘ What is Amazon EKS:

 ➰ AWS managed Kubernetes service to run Kubernetes in the AWS cloud and on-premises data centers as well.

 ➰ It automatically manages the availability and scalability of the Kubernetes control plane nodes and major tasks.

 ➰ Amazon EKS is integrated with AWS services such as Elastic load balancer, IAM, VPC, and CloudTrails are added advantage.
Enter fullscreen mode Exit fullscreen mode

πŸ“Œ Explore more about EKS: https://aws.amazon.com/eks/

πŸ“Œ Explore more about SageMaker HyperPod: https://aws.amazon.com/blogs/aws/amazon-sagemaker-hyperpod-introduces-amazon-eks-support/

Top comments (0)