This is a collection of most common bash scripts to automate Databricks.
All the scenarios depend on Databricks CLI installed and configured. These examples also use jq extensively which is a part of most Linux distros.
Create or Update a Cluster Instance Pool
Input:
-
POOL_NAME
env var. -
CONFIG_PATH
env var.
Using Instance Pools CLI.
#!/bin/bash
export POOL_ID=$(databricks instance-pools list --output JSON | jq -r --arg I "$POOL_NAME" '.instance_pools[] | select (.instance_pool_name == $I) | .instance_pool_id')
if [[ "$POOL_ID" == "" ]]; then
echo "creating pool"
databricks instance-pools create --json-file "$CONFIG_PATH"
else
echo "pool already exists, issuing edit on pool $POOL_ID"
envsubst < $CONFIG_PATH > ./tmp.json
cat ./tmp.json
databricks instance-pools edit --json-file ./tmp.json
fi
Pool configuration file looks like (note the POOL_ID
being replaced by envsubst
):
{
"instance_pool_name": "General",
"instance_pool_id": "$POOL_ID",
"min_idle_instances": 1,
"max_capacity": 10,
"node_type_id": "Standard_DS3_v2",
"idle_instance_autotermination_minutes": 60,
"enable_elastic_disk": true,
"preloaded_spark_versions": [
"7.3.x-scala2.12"
],
"azure_attributes": {
"availability": "ON_DEMAND_AZURE"
}
}
Create or Update a Cluster
Attaching to a pool is suported!
Input:
-
CLUSTER_NAME
env var. -
CONFIG_PATH
env var. -
POOL_NAME
env var.
Using Clusters CLI.
#!/bin/bash
export CLUSTER_ID=$(databricks clusters list --output JSON | jq -r --arg I "$CLUSTER_NAME" '.clusters[] | select(.cluster_name == $I) | .cluster_id')
export POOL_ID=$(databricks instance-pools list --output JSON | jq -r --arg I "$POOL_NAME" '.instance_pools[] | select (.instance_pool_name == $I) | .instance_pool_id')
envsubst < $CONFIG_PATH > ./tmp.json
cat ./tmp.json
if [[ "$CLUSTER_ID" == "" ]]; then
echo "creating new cluster"
databricks clusters create --json-file ./tmp.json
else
echo "cluster already exists, issuing edit on cluster $CLUSTER_ID"
databricks clusters edit --json-file ./tmp.json
fi
Sample cluster config:
{
"cluster_name": "$CLUSTER_NAME",
"cluster_id": "$CLUSTER_ID",
"spark_version": "7.3.x-scala2.12",
"autoscale": {
"min_workers": 1,
"max_workers": 4
},
"num_workers": 1,
"spark_conf": {
"spark.databricks.delta.preview.enabled": "true"
},
"instance_pool_id": "$POOL_ID"
}
Create or Update a Job by Name
Input:
-
JOB_NAME
set as environment variable. -
job.json
is a local file describing the job.
Using Jobs CLI.
# check if job exists (by name) and get it's ID
JOB_ID=$(databricks jobs list --output JSON | jq -r --arg I "$JOB_NAME" '.jobs[] | select (.settings.name == $I) | .job_id')
if [[ "$JOB_ID" == "" ]]; then
echo "creating a new job"
databricks jobs create --json-file job.json
else
echo "updating job $JOB_ID"
databricks jobs reset --job-id $JOB_ID --json-file job.json
fi
Terminate All Job Runs and Start Again
Input:
-
JOB_ID
env var.
#!/bin/bash
# stop all job runs
until [[ -z $(databricks runs list --job-id "$JOB_ID" --active-only --output JSON | jq '.runs | .[]? | .run_id') ]] ;
do
echo "job is still running...."
RUN_ID=$(databricks runs list --job-id "$JOB_ID" --active-only --output JSON | jq '.runs | .[]? | .run_id')
echo "cancelling run '$RUN_ID'"
databricks runs cancel --run-id "$RUN_ID" > /dev/null
sleep 5s
done
# start the job again
echo "staring job $JOB_ID"
databricks jobs run-now --job-id "$JOB_ID"
Note the jq syntax to check array for nulls if the job has no current runs.
Tips
The easiest way to set up CLI (especially in CI/CD environment) is to set 2 environment variables - DATABRICKS_HOST
and DATABRICKS_TOKEN
.
P.S. Originally posted on my blog.
Top comments (1)
JOB_ID=$(databricks jobs list --output JSON | jq -r --arg I "$JOB_NAME" '.jobs[] | select (.settings.name == $I) | .job_id')
while running above line, it throwing error for me like jq: error (at :371): Cannot index array with string "jobs"
could you please help me here , I have the similar kind of requirement