DEV Community

Cover image for Adding a New Partition to a Running CycleCloud SLURM Cluster

Adding a New Partition to a Running CycleCloud SLURM Cluster

Adding a new partition to a running SLURM cluster in Microsoft Azure CycleCloud allows you to tailor the computational resources to specific workloads and efficiently manage job scheduling. This guide will walk you through the process of creating and configuring a new partition in an existing SLURM cluster on Azure CycleCloud, enabling you to optimize resource allocation and improve overall performance.

Overview

SLURM (Simple Linux Utility for Resource Management) is a popular open-source workload manager used in high-performance computing (HPC) environments. Azure CycleCloud provides a cloud-based solution for deploying and managing SLURM clusters, making it easy to scale compute resources in Azure. By adding a new partition to a running SLURM cluster, you can segment your computational resources into distinct groups, each with its own policies and configurations.

Prerequisites

Before you begin, ensure you have the following:

  1. Access to Azure CycleCloud: You need an active Azure account and access to CycleCloud to manage your SLURM cluster.

  2. Admin Access to the SLURM Cluster: Ensure you have administrative privileges on the SLURM cluster to make configuration changes.

  3. SSH Access to the SLURM Controller: You will need to SSH into the SLURM controller node to modify configuration files.

  4. Understanding of SLURM Concepts: Familiarity with SLURM partitions, nodes, and configurations is helpful for this task.

Steps to Add a New Partition

Follow these steps to add a new partition to a running SLURM cluster in CycleCloud:

Step 1: Access the SLURM Controller Node

  1. SSH into the Controller Node: Use SSH to connect to the SLURM controller node. You can find the public IP address of the controller node in the Azure portal under the CycleCloud cluster resources.

    ssh azureuser@<controller-node-ip>
    
  2. Switch to Root User: Once logged in, switch to the root user to edit SLURM configuration files.

    sudo su -
    

Step 2: Edit the SLURM Configuration File

  1. Locate the Configuration File: The SLURM configuration file is typically located at /etc/slurm/slurm.conf. Open this file in a text editor:

    nano /etc/slurm/slurm.conf
    
  2. Define the New Partition: Add a new partition definition at the end of the configuration file. For example, to add a partition named new_partition, you would add:

    PartitionName=new_partition Nodes=node[0-3] Default=NO MaxTime=INFINITE State=UP
    
- **PartitionName**: Name of the new partition.
- **Nodes**: Specify the nodes that will be part of this partition. Adjust the node range as needed.
- **Default**: Set to `NO` if this partition should not be the default for new jobs.
- **MaxTime**: Maximum job runtime allowed in this partition. Use `INFINITE` for no limit.
- **State**: Set the initial state of the partition (e.g., `UP` for active).
Enter fullscreen mode Exit fullscreen mode
  1. Save the Configuration File: After adding the new partition, save the changes and exit the text editor.

Step 3: Update the SLURM Daemons

  1. Reload SLURM Configuration: Restart the SLURM controller daemon to apply the changes.

    systemctl restart slurmctld
    
  2. Verify the Configuration: Check the SLURM status to ensure the new partition is recognized and active.

    sinfo
    

    You should see the new partition listed in the output, indicating that it is ready for use.

Step 4: Update Azure CycleCloud Configuration

While not strictly necessary, updating the Azure CycleCloud configuration ensures that any future changes or redeployments reflect the new partition setup.

  1. Access CycleCloud Portal: Log in to the Azure CycleCloud portal.

  2. Edit Cluster Configuration: Navigate to your SLURM cluster and edit its configuration to include the new partition settings.

  3. Save and Apply Changes: Save the configuration and apply changes to update the cluster with the new partition settings.

Step 5: Submit Jobs to the New Partition

With the new partition added, you can start submitting jobs to it. Use the --partition flag in your sbatch or srun commands to specify the target partition.

sbatch --partition=new_partition my_script.sh
Enter fullscreen mode Exit fullscreen mode

Learn More :- https://techcommunity.microsoft.com/t5/azure-high-performance-computing/add-a-new-partition-to-a-running-cyclecloud-slurm-cluster/ba-p/4209171?wt.mc_id=studentamb_407231

Conclusion

Adding a new partition to a running SLURM cluster in Azure CycleCloud is a straightforward process that enables you to optimize resource allocation and job scheduling. By following these steps, you can efficiently manage your cluster's computational resources and tailor them to meet the needs of specific workloads.

Leveraging the power of Azure CycleCloud and SLURM, you can create a flexible and scalable HPC environment that adapts to your organization's evolving requirements.

Top comments (0)