Technical Background
Green AI has become a key focus area for our AI Infra team. As a crucial component of green AI, the engineering efficiency project aims to build a high-performance offline-online AI engineering system, by enhancing compute efficiency and resource utilization, ultimately achieving resource conservation and carbon emissions reduction.
Currently, the tools users use to submit distributed training jobs are either Yarn or KubeFlow/Training-Operator. When submitting a job, users need to specify job resources, including the number and resource specifications of different node roles (such as CPU cores, memory, GPUs, etc.).
After submitting the training task, it may encounter the following issues:
● The cluster resources are insufficient to launch all nodes of the job, resulting in the job having to wait.
● Nodes in training operations can encounter issues such as being preempted by high-priority tasks, equipment failures, or I/O faults, which can lead to job failures.
When these issues occur, users can only modify job resources to resubmit the job.
In response to these issues, Ant Group initially open-sourced the ElasticDL project based on Kubernetes to support fault-tolerant distributed training of TF 2.x on K8s early on. During the project implementation, we encountered the following challenges:
● Insufficient allocated resources might cause OOM and poor training performance.
● Users often over-provision resources to ensure task success and speed, resulting in low utilization.
● An increasing number of users are now developing and training models using PyTorch or other frameworks besides TF.
● More and more distributed clusters start to support AI jobs, such as Ray and Spark clusters, can they be adapted to arbitrary computing clusters?
● With online learning becoming increasingly prevalent, How can a system be employed to harmonize and support both online and offline training simultaneously?
The first two issues result in cluster CPU utilization typically hovering around 20%, while algorithm developers have to invest considerable effort in manual operations and maintenance. To address the need for improved resource efficiency on the training end, we support various offline training modes across different clusters, automatically identifying optimal resource configurations for distributed training jobs across different frameworks.
Inspired by the fault-tolerant flexibility of ElasticDL, our AI Infra team has upgraded, extended, and open-sourced DLRover. Its goal is to enhance the intelligence of distributed model training. Currently, many companies run their training jobs on heterogeneous clusters with complex and ever-changing environments.
Star our project on GitHub: https://github.com/intelligent-machine-learning/dlrover
Overall Solution
DLRover introduces the concept of "ML for System" to enhance the intelligence of distributed training. What capabilities should such a system have?
We believe the key capabilities include the following:
- Decoupling: The system should not be tightly coupled with the underlying training frameworks, relying only on abstract interfaces and following the principle of dependency inversion (e.g., Elastic Runtime).
- Resource Scheduling: It should possess a "god's-eye view" for resource scheduling and control, making decisions based on precise job profiling.
- Data-Driven: The system should collect and utilize both cluster resource data and training job data, leveraging data to drive intelligent decisions.
- Job Interaction: With a deep understanding of training jobs and models (white-box approach), the system should dynamically optimize and adjust training jobs based on actual conditions, going beyond simple mechanical fault tolerance.
- Intelligence: By aggregating cluster and job information, and combining algorithmic models with fixed strategies, the system should produce accurate job optimization strategies.
We aim to design and implement a system that completely frees users from the constraints of resource configuration, allowing them to focus on the model training itself. Even without any resource configuration input, DLRover can still provide optimal resource allocation for each training job. Recognizing that users might run their training jobs in different ways, DLRover offers not only Cluster Mode for unified job management across training platforms but also Single-Job Mode, enabling independent algorithm developers to benefit from essential features like elastic fault tolerance.
System Architecture
DLRover is composed of four main components: ElasticJob, Elastic Trainer, Brain Service, and Cluster Monitor.
The diagram above illustrates how DLRover manages deep learning training jobs on a Kubernetes (K8s) cluster. DLRover submits jobs to the cluster in the form of an ElasticJob Custom Resource Definition (CRD). Upon receiving the CRD, the ElasticJob Operator initiates a Master Pod, which acts as the Elastic Trainer. The Elastic Trainer obtains the initial resource plan from the Brain Service. Using this plan, the Elastic Trainer creates a Scale CRD and applies it, notifying the ElasticJob Controller to start the required Pods. Each Pod will then launch an Elastic Agent.
During the training process, the Training Master within the Elastic Trainer distributes data shards to the Workers. Meanwhile, the Cluster Monitor keeps track of the running status of each job (i.e., workload on each node) and the overall cluster status (i.e., resource levels). This data is periodically reported back to the Brain, which persists the information in a database.
DLRover's Brain then selects the appropriate algorithm based on the job's running status to generate a new resource plan and notifies the Elastic Trainer to begin resource adjustment.
In summary, DLRover facilitates the automated operation of distributed training jobs within a cluster. It can be viewed as automated orchestration for distributed tasks, allowing model developers to focus solely on algorithm design. The current open-source version of DLRover provides users with the following capabilities:
- Automatic Resource Inference: Helps users automatically initialize training resources, improving resource utilization and job stability.
- Dynamic Data Sharding: Mitigates the bottleneck effect caused by different Worker performance levels by allocating training data according to actual consumption speed. It can work with Failover to record consumption checkpoints, ensuring no data loss.
- Single Point Fault Tolerance: Offers fault tolerance for a single point, without the need for a full restart of the job.
- Resource Elasticity: Supports runtime scaling at the Pod level and CPU/Memory level, with dynamic global optimization decisions.
What DLRover Brings
- Zero Resource Configuration for Jobs
Users do not need to provide any resource information when submitting distributed jobs. DLRover automatically profiles the job and infers the optimal resource configuration. Additionally, during runtime, it can adjust resources automatically based on real-time conditions (cluster resources, sample throughput, current utilization, etc.). Below is a comparison of two job submission scripts:
- Single-Point Fault Tolerance Enhances Job Stability and Recovery Efficiency
DLRover supports single-point recovery for Parameter Server and Worker failures without requiring a complete job restart. It can seamlessly restart nodes for errors unrelated to user code or data types. For instance, a common issue in clusters is out-of-memory (OOM) errors due to insufficient memory configuration by users. With DLRover, we can automatically spawn a new node with optimized configurations to recover the failed node. In real-world scenarios, DLRover-managed training jobs improved training success rates from 84% to over 95% compared to baseline Kubeflow TF-Operator jobs.
- Automatic Scaling to Improve Training Performance
DLRover supports automatic adjustment of training resources at both the Parameter Server and Worker levels during runtime to enhance training performance. By monitoring the workload of job nodes, DLRover can identify resource bottlenecks. Common bottlenecks include node preemption, workload imbalance, CPU insufficiency leading to reduced computational power, and inadequate number of nodes. DLRover can continuously optimize training performance through dynamic resource hot updates.
Using the Kaggle CRITEO dataset, we trained Wide&Deep and xDeepFM models for 10 epochs on a Kubernetes (K8s) cluster. The results demonstrated that DLRover significantly alleviated the aforementioned resource bottlenecks, thereby improving training throughput and reducing overall training time.
- Auto-scaling to Enhance Resource Utilization for jobs
Different model training jobs often require varying resource configurations. However, users tend to over-provision resources to ensure job success, leading to significant resource wastage. DLRover's automatic scaling capability dynamically allocates resources based on the actual needs of the job, achieving optimal training performance with minimal resources, thereby reducing resource waste. The following chart compares resource utilization between automatic and manual resource configurations:
- Dynamic Data Distribution to Address Slow Node Issues
In mixed clusters, resource overcommitment and preemption can lead to some nodes consuming data slower than others, causing faster nodes to wait and thereby slowing down the training process. DLRover dynamically distributes less data to slower nodes to reduce waiting times. Additionally, DLRover ensures that training tasks consume data as per user-configured parameters, avoiding duplicate consumption or data loss, which could introduce uncertainty and affect model performance.
During scaling operations, a global coordinator is required to record the details of data consumption for each node. When a node fails and restarts, this coordinator must know both consumed and unconsumed data. If this logic is managed by training nodes, it increases complexity due to the need for interaction between nodes. DLRover Master serves as this global coordinator.
In summary, from our perspective, dynamic data distribution simplifies the complexity of training node logic. Training nodes only need to fetch shards from the DLRover Master and read the data, without handling additional logic.
- Unified Paradigm for Offline and Online Learning
The aforementioned dynamic data sharding feature helps decouple the data source from the training jobs. This allows DLRover to support both offline training and real-time sample stream consumption for online learning tasks. (You can directly connect sample streams via Dlrover.trainer or use it as a training sink node for a stream processing engine.)
In Ant Group's practical use cases, DLRover serves as an ideal component for building end-to-end online learning systems. DLRover provides solutions for various real-world issues such as data source consumption checkpoint recording and recovery, stability and performance assurance for long-running online learning jobs, and resource utilization optimization. Our open-source repository includes simple examples, and we plan to release more related components in the future.
- Support for Asynchronous and Synchronous Training Modes
Training clusters run various domain-specific training jobs daily. Large-scale sparse models in recommendation systems typically use the PS/Worker architecture for asynchronous parameter updates, with CPU resources being the primary compute resource. Dense models in the CV/NLP domain often perform synchronous training on GPU servers using data parallelism, where only the Worker role exists.
DLRover is designed to support both synchronous and asynchronous update modes, providing a unified solution for various training paradigms.
- Decoupling Training Frameworks from Distributed Training Infrastructure
DLRover allows users to employ any training framework of their choice. The underlying training code interacts with DLRover through agreed-upon API interfaces to enable automatic elastic scaling and other features, without requiring deep interaction with the distributed training code. Once deployed in a cluster, data scientists can seamlessly integrate their algorithms with minimal adjustments.
Summary & Future Plans
DLRover has already been widely adopted within Ant Group, resulting in over a 15% improvement in cluster resource utilization compared to the baseline. It has also effectively addressed the issue of suboptimal training throughput due to improper resource allocation. By open-sourcing DLRover, we aim to assist our peers in promoting low-carbon, green, and AI-driven initiatives, while also reducing operational costs in model development and freeing up productivity to solve business problems.
Currently, DLRover's optimization algorithms and resource/job profiling strategies are primarily tailored for Ant Group’s internal tech stack. Given the diversity of technical stacks across different organizations, DLRover is designed with a unified interface abstraction at the API layer, allowing flexible customization of specific optimization algorithms and job profiling strategies. We welcome developers from different organizations to join us in co-developing the DLRover project, enhancing and expanding its capabilities to benefit the broader community.
Star our project on GitHub: https://github.com/intelligent-machine-learning/dlrover
Top comments (0)