Azure Batch
-
Azure Batch
- Platform to run high-performance computing jobs in parallel at large scale
- Manages cluster of machines and supports autoscaling
- Allows you to install applications that can run as a job
- Schedule and run jobs on cluster machines
- Pay per minute for resources used
-
How it works
- Pool = cluster of machines/nodes
- Slot = set of resources used to execute a task
- Define number of slots per node
- Increase slots per node to improve performance without increasing cost
- Job assigns tasks to slots on nodes
- Application is installed on each node to execute the tasks
- Specify application packages at pool or task level
Configure the batch size
- In the portal (Batch account)
- Choose Pools in the left-side panel
- Add a new pool and name it
- Define the OS image (publisher and sku)
- Choose VM size (determines cores and memory))
- Choose fixed or auto scale for nodes
- If fixed, select number of nodes
- Choose application packages and versions, uploading files if necessary
- Use Mount configuration to mount storage file shares, specifying the account name and access key of the storage account
Trigger batches
-
In the portal (Batch)
- Confirm that the pool is in steady state and the nodes are in idle state
- Choose Jobs in the left-side panel and add a new job
- Name the job and select the pool
- Open the job and select Tasks in the left-side panel
- Define name and description
-
Enter the command in the command line box that will run on each machine
- Reference installed packages with %AZ_BATCH_APP_PACKAGE_#%
- Reference path to input fileshare with -i S:<file_path>
- Reference path to output with S:<file_path>
- Submit task
-
In Azure Data Factory and Azure Synapse
-
To run a single task in ADF
-
Create linked service to Azure Batch
- Need Batch account name, account endpoint, and primary access key from the Keys section in the Batch portal
- Also need the name of the pool
-
Create pipeline to run Custom Batch activity
- Select linked service under the Azure Batch option in the activity settings
-
Define command to execute utility
- Enter in the Command box under Settings for the activity
-
Create linked service to Azure Batch
-
To run multiple tasks in parallel
-
Get list of files using Get Metadata activity in the General option
- Configure data set and linked service with Azure File Storage
- Use the Field list to select Child items
-
Use a ForEach activity to iterate through the Child items
- Use dynamic content in the Command to add the filename for each file
-
Get list of files using Get Metadata activity in the General option
-
To run a single task in ADF
Handle failed batch loads
- Failure types
- Infrastructure - pool and node errors
- Application - job and task errors
-
Pool errors
- Resizing failure - pool is unable to provision a node within the resize timeout window (default is 15 mins)
- Insufficient quota - account has limited number of core quotas, and if allocation exceeds this number then it fails (raise support ticket to increase quota)
- Scaling failures - formula is used to determine autoscaling, and formula evaluation can fail (check logs to find issue)
-
Node issues
- App package download failure - node set to unusable, needs to be reimaged
- Node OS updates - tasks can be interrupted by updates, auto update can be disabled
- Node in unusable state - even if pools is ready pool can be in unusable state (VM crash, firewall block, invalid app package), needs to be re-imaged
- Node disk is full
- Rebooting and re-imaging can be done in the Batch portal under Pools
- The Connect option in portal allows you to use RDP/SSH to connect to the VM
- Define user details
- Set as Admin
- Download RDP file and enter user credentials
- This opens Server Manager window where you can navigate the file system to check application package installations
Validate batch loads
-
Job errors
-
Timeout
- Max wall clock time defines max time allowed for job to run from the time it was created
- Default value is unlimited
- If max is reached, running tasks are killed
- Increase max wall clock value to prevent timeout
-
Failure of job-related tasks
- Each job has job-related preparation tasks that run once for the job
- Job prep task runs on each node as soon as job is created
- Job release task runs on each node when job terminates
- Failures can occur in these tasks
-
Timeout
-
Task errors
- Task waiting - dependency on another task
- Task timeout- check max wall clock time
- Missing app packages or resource files
- Error in command defined in the task
- Check stdout and stderr logs for details
- In the Batch portal under node details, you can specify a container where log files are stored for future reference
Configure batch retention
- Retention time defines how long to keep task directory on node once task is complete
- Configure at Job level or Task level
- Retention time field in advanced settings
- Default is 7 days unless removed or deleted
Manage data pipelines in Azure Data Factory or Azure Synapse Pipelines
- Ways to run pipelines
-
Debug Run
- Don't need to save changes
- Directly run pipelines with draft changes
- Manual, can't be scheduled
-
Trigger Run
- Need to publish changes first
- Only runs published version of pipeline
- Can be manual or scheduled
-
Debug Run
Schedule data pipelines in Data Factory or Azure Synapse Pipelines
- Trigger types
- Scheduled - run on wall-clock schedule
- Tumbling window - run at periodic intervals while maintaining state
- Storage event - run pipeline when file is uploaded or deleted from a storage account
- Custom event trigger - runs pipeline when event is raised by Azure Event Grid
-
Scheduled vs tumbling triggers
-
Scheduled
- Only supports future-dated loads
- Does not maintain state, only fire and forget
-
Tumbling
- Can run back-dated and future-dated loads
- Maintains state (completed loads)
- Passes start and end timestamps of window as parameters
- Can be used to add dependency between pipelines, allowing complex scenarios
-
Scheduled
Implement version control for pipeline artifacts
- Authoring modes
-
Live mode (default)
- Authoring directly against pipelines
- No option to save draft changes
- Need to publish to save valid changes
- Need manually created ARM templates to deploy pipelines to other environments
-
Git Repo mode
- Repo can be in ADO or GitHub
- All artifacts can be stored in source control
- Draft changes can be saved even if not valid
- Autogenerates ARM templates for deployment in other environments
- Enables DevOps features (PRs, reviews, collab)
-
Live mode (default)
Manage Spark jobs in a pipeline
- Pipeline activities for Spark
- Synapse - Spark notebook, Spark job
- Databricks - notebook, Jar file, Python file
- HDInsight activities - Spark Jar/script
- Monitoring Spark activities
- Monitoring built in to ADF
- Platform monitoring (Synapse, Databricks)
- In ADF/Synapse, go to Montior --> Apache Spark applications and select a specific run for details
- Spark UI
Top comments (0)