DEV Community

Karthik Sakthivel
Karthik Sakthivel

Posted on

Amazon EKS support in Amazon SageMaker HyperPod to scale foundation model development

What's new at AWS 📢

☑ #Amazon EKS support in Amazon SageMaker HyperPod to scale foundation model development

☑ This new availability enables customers to run and manage their Kubernetes workloads on SageMaker HyperPod, a purpose-built infrastructure for foundation model (FM) development which reduces time to train models by up to 40%.

☑ Many customers use Kubernetes to orchestrate their ML workflows due to its portability, scalability, and rich ecosystem of tools. However managing hardware failures are not automated.

☑ With this launch, customers can run deep health checks during cluster creation and automated hardware failures during ML trainings and fine-tuning.

☑ In addition, HyperPod automatically replaces faulty nodes(self-healing performant clusters) and resumes training from the last checkpoint on both AWS Trainium and Nvidia GPU at a scale of more than a thousand accelerators.

☑ EKS orchestrated HyperPod clusters also integrate with CloudWatch Container Insights to provide out-of-the-box observability of health status checks and visual dashboards.

☑ Customer can use HyperPod CLI, or their preferred tools, to submit, manage, and monitor workloads.

☑ What is Amazon EKS:

 ➰ AWS managed Kubernetes service to run Kubernetes in the AWS cloud and on-premises data centers as well.

 ➰ It automatically manages the availability and scalability of the Kubernetes control plane nodes and major tasks.

 ➰ Amazon EKS is integrated with AWS services such as Elastic load balancer, IAM, VPC, and CloudTrails are added advantage.
Enter fullscreen mode Exit fullscreen mode

📌 Explore more about EKS: https://aws.amazon.com/eks/

📌 Explore more about SageMaker HyperPod: https://aws.amazon.com/blogs/aws/amazon-sagemaker-hyperpod-introduces-amazon-eks-support/

Top comments (0)