DEV Community

loading...

Automating Databricks with Bash

aloneguid profile image Ivan G ・3 min read

This is a collection of most common bash scripts to automate Databricks.

All the scenarios depend on Databricks CLI installed and configured. These examples also use jq extensively which is a part of most Linux distros.

Create or Update a Cluster Instance Pool

Input:

  • POOL_NAME env var.
  • CONFIG_PATH env var.

Using Instance Pools CLI.

#!/bin/bash

export POOL_ID=$(databricks instance-pools list --output JSON | jq -r --arg I "$POOL_NAME" '.instance_pools[] | select (.instance_pool_name == $I) | .instance_pool_id')

if [[ "$POOL_ID" == "" ]]; then
    echo "creating pool"
    databricks instance-pools create --json-file "$CONFIG_PATH"
else
    echo "pool already exists, issuing edit on pool $POOL_ID"
    envsubst < $CONFIG_PATH > ./tmp.json
    cat ./tmp.json
    databricks instance-pools edit --json-file ./tmp.json
fi
Enter fullscreen mode Exit fullscreen mode

Pool configuration file looks like (note the POOL_ID being replaced by envsubst):

{
    "instance_pool_name": "General",
    "instance_pool_id": "$POOL_ID",
    "min_idle_instances": 1,
    "max_capacity": 10,
    "node_type_id": "Standard_DS3_v2",
    "idle_instance_autotermination_minutes": 60,
    "enable_elastic_disk": true,
    "preloaded_spark_versions": [
        "7.3.x-scala2.12"
      ],
    "azure_attributes": {
      "availability": "ON_DEMAND_AZURE"
    }
}
Enter fullscreen mode Exit fullscreen mode

Create or Update a Cluster

Attaching to a pool is suported!

Input:

  • CLUSTER_NAME env var.
  • CONFIG_PATH env var.
  • POOL_NAME env var.

Using Clusters CLI.

#!/bin/bash

export CLUSTER_ID=$(databricks clusters list --output JSON | jq -r --arg I "$CLUSTER_NAME" '.clusters[] | select(.cluster_name == $I) | .cluster_id')
export POOL_ID=$(databricks instance-pools list --output JSON | jq -r --arg I "$POOL_NAME" '.instance_pools[] | select (.instance_pool_name == $I) | .instance_pool_id')

envsubst < $CONFIG_PATH > ./tmp.json

cat ./tmp.json

if [[ "$CLUSTER_ID" == "" ]]; then
    echo "creating new cluster"
    databricks clusters create --json-file ./tmp.json
else
    echo "cluster already exists, issuing edit on cluster $CLUSTER_ID"
    databricks clusters edit --json-file ./tmp.json
fi

Enter fullscreen mode Exit fullscreen mode

Sample cluster config:

{
    "cluster_name": "$CLUSTER_NAME",
    "cluster_id": "$CLUSTER_ID",
    "spark_version": "7.3.x-scala2.12",
    "autoscale": {
        "min_workers": 1,
        "max_workers": 4
    },
    "num_workers": 1,
    "spark_conf": {
        "spark.databricks.delta.preview.enabled": "true"
    },
    "instance_pool_id": "$POOL_ID"
}
Enter fullscreen mode Exit fullscreen mode

Create or Update a Job by Name

Input:

  • JOB_NAME set as environment variable.
  • job.json is a local file describing the job.

Using Jobs CLI.

# check if job exists (by name) and get it's ID
JOB_ID=$(databricks jobs list --output JSON | jq -r --arg I "$JOB_NAME" '.jobs[] | select (.settings.name == $I) | .job_id')

if [[ "$JOB_ID" == "" ]]; then
  echo "creating a new job"
  databricks jobs create --json-file job.json
else
  echo "updating job $JOB_ID"
  databricks jobs reset --job-id $JOB_ID --json-file job.json
fi
Enter fullscreen mode Exit fullscreen mode

Terminate All Job Runs and Start Again

Input:

  • JOB_ID env var.
#!/bin/bash

# stop all job runs
until [[ -z $(databricks runs list --job-id "$JOB_ID" --active-only --output JSON | jq '.runs | .[]? | .run_id') ]] ;
do
  echo "job is still running...."

  RUN_ID=$(databricks runs list --job-id "$JOB_ID" --active-only --output JSON | jq '.runs | .[]? | .run_id')
  echo "cancelling run '$RUN_ID'"
  databricks runs cancel --run-id "$RUN_ID" > /dev/null

  sleep 5s
done

# start the job again
echo "staring job $JOB_ID"
databricks jobs run-now --job-id "$JOB_ID"
Enter fullscreen mode Exit fullscreen mode

Note the jq syntax to check array for nulls if the job has no current runs.

Tips

The easiest way to set up CLI (especially in CI/CD environment) is to set 2 environment variables - DATABRICKS_HOST and DATABRICKS_TOKEN.

P.S. Originally posted on my blog.

Discussion (0)

pic
Editor guide