DEV Community

ACE Co-innovation Ecosystem
ACE Co-innovation Ecosystem

Posted on

Enabling AI: Announcing the Ray on Open-Source Plugin

Author: Ala Dewberry, Senior Product Manager in xLabs, a product incubation program in OCTO. Sean Huntley, Product Engineer in the Advanced Technologies Group within the Office of the CTO.

In the last year, there has been an explosive amount of progress in machine learning and artificial intelligence. High-quality generative AI solutions like ChatGPT have ushered in a public interest that has carried over to the business world. Organizations and individuals alike are considering how they can make use of this technology to accelerate their impact and delight their customers.

While these general-use models are fantastic, practitioners often fall short in industry-specific use cases. Publicly available training data cannot prepare the model for niche expertise needed to address use cases unique to each business. To meet these needs, many organizations are investing in tuning and training their own models. To do so, they need to scale their compute footprint beyond an engineer’s laptop or existing build tooling. Data scientists and ML Engineers need access to both tools that help them scale their workloads and computing resources to match.

To meet these challenges, VMware is excited to announce our partnership with Anyscale, the creators of Ray. Ray is a distributed Python workload scheduler optimized for ML workloads, bringing serverless-style scaling to training and inferencing workloads. Ray brings broad adoption and excellent performance when it comes to parallel processing and distributed computing.

Anyscale and VMware have partnered to create an open-source plugin to run Ray on vSphere using virtual machines. This plugin enables system administrators to serve data science teams with compute infrastructure that meets their needs. When data science teams have access to compute to run the workloads that power their data exploration, cleaning, and model experimentation, organizations can reduce the time it takes to go from raw data to a differentiated model that furthers the target business outcome. It’s DevOps all over again, but this time the goal is to ship working models to production.

How Does it Work?

A Ray cluster contains a head node and worker nodes.

Image description

The head node manages the cluster and scales the number of worker nodes within it. These distributed worker nodes are responsible for training, fine-tuning, and serving models.

To get started, the Head Node’s Autoscaler needs to understand how large a cluster it can provision and where it can provision it. It does this with a Cluster Configuration File.

To make this possible, our plugin extends the Autoscaler to work directly with VMs on vSphere.

Image description

To orchestrate Ray workloads, the Autoscaler plug-in makes calls to a vSphere cluster. A vSphere cluster is a group of hosts where the resources of the host become part of the resources of the cluster. The cluster manages the resources of all hosts within it. Clusters enable vSphere High Availability (HA) and vSphere Distributed Resource Scheduler (DRS). These features ensure that the Ray cluster is fault-tolerant, isolated from other mission-critical workloads and that compute resources are optimally allocated.

Configuring a vSphere Provider

The image below shows a sample Ray Cluster configuration file for use with vSphere. In the provider section, we must specify the type as vSphere and specify credentials for the vSphere cluster and a datastore to deploy the Ray cluster on.

Image description

Additionally, in both the worker node and head configuration, we can target a specific resource pool to isolate Ray workers from other workloads. As a performance improvement, we may also specify a frozen VM. This frozen VM is left frozen to be used as an instant clone to rapidly scale out worker nodes.

What’s Next

What we’ve shared today is just step one. We are currently exploring how to capture unutilized compute to train ML models at quiet times in the data center. Enabling organizations to get more value from their data center without endangering production workloads. It’s also great for the planet!

We are ready to welcome the new age of automation with our Ray on vSphere plugin and streamline access to Machine Learning. Join us on this journey by trying out the plugin once available, joining the Slack channel, or emailing us with questions at rayonvmware@vmware.com.

Top comments (0)