Ivan G

Posted on Mar 16, 2021

Automating Databricks with Bash

#databricks #bash

This is a collection of most common bash scripts to automate Databricks.

All the scenarios depend on Databricks CLI installed and configured. These examples also use jq extensively which is a part of most Linux distros.

Create or Update a Cluster Instance Pool

Input:

POOL_NAME env var.
CONFIG_PATH env var.

Using Instance Pools CLI.

#!/bin/bash

export POOL_ID=$(databricks instance-pools list --output JSON | jq -r --arg I "$POOL_NAME" '.instance_pools[] | select (.instance_pool_name == $I) | .instance_pool_id')

if [[ "$POOL_ID" == "" ]]; then
    echo "creating pool"
    databricks instance-pools create --json-file "$CONFIG_PATH"
else
    echo "pool already exists, issuing edit on pool $POOL_ID"
    envsubst < $CONFIG_PATH > ./tmp.json
    cat ./tmp.json
    databricks instance-pools edit --json-file ./tmp.json
fi

Pool configuration file looks like (note the POOL_ID being replaced by envsubst):

{
    "instance_pool_name": "General",
    "instance_pool_id": "$POOL_ID",
    "min_idle_instances": 1,
    "max_capacity": 10,
    "node_type_id": "Standard_DS3_v2",
    "idle_instance_autotermination_minutes": 60,
    "enable_elastic_disk": true,
    "preloaded_spark_versions": [
        "7.3.x-scala2.12"
      ],
    "azure_attributes": {
      "availability": "ON_DEMAND_AZURE"
    }
}

Create or Update a Cluster

Attaching to a pool is suported!

Input:

CLUSTER_NAME env var.
CONFIG_PATH env var.
POOL_NAME env var.

Using Clusters CLI.

#!/bin/bash

export CLUSTER_ID=$(databricks clusters list --output JSON | jq -r --arg I "$CLUSTER_NAME" '.clusters[] | select(.cluster_name == $I) | .cluster_id')
export POOL_ID=$(databricks instance-pools list --output JSON | jq -r --arg I "$POOL_NAME" '.instance_pools[] | select (.instance_pool_name == $I) | .instance_pool_id')

envsubst < $CONFIG_PATH > ./tmp.json

cat ./tmp.json

if [[ "$CLUSTER_ID" == "" ]]; then
    echo "creating new cluster"
    databricks clusters create --json-file ./tmp.json
else
    echo "cluster already exists, issuing edit on cluster $CLUSTER_ID"
    databricks clusters edit --json-file ./tmp.json
fi

Sample cluster config:

{
    "cluster_name": "$CLUSTER_NAME",
    "cluster_id": "$CLUSTER_ID",
    "spark_version": "7.3.x-scala2.12",
    "autoscale": {
        "min_workers": 1,
        "max_workers": 4
    },
    "num_workers": 1,
    "spark_conf": {
        "spark.databricks.delta.preview.enabled": "true"
    },
    "instance_pool_id": "$POOL_ID"
}

Create or Update a Job by Name

Input:

JOB_NAME set as environment variable.
job.json is a local file describing the job.

Using Jobs CLI.

# check if job exists (by name) and get it's ID
JOB_ID=$(databricks jobs list --output JSON | jq -r --arg I "$JOB_NAME" '.jobs[] | select (.settings.name == $I) | .job_id')

if [[ "$JOB_ID" == "" ]]; then
  echo "creating a new job"
  databricks jobs create --json-file job.json
else
  echo "updating job $JOB_ID"
  databricks jobs reset --job-id $JOB_ID --json-file job.json
fi

Terminate All Job Runs and Start Again

Input:

JOB_ID env var.

#!/bin/bash

# stop all job runs
until [[ -z $(databricks runs list --job-id "$JOB_ID" --active-only --output JSON | jq '.runs | .[]? | .run_id') ]] ;
do
  echo "job is still running...."

  RUN_ID=$(databricks runs list --job-id "$JOB_ID" --active-only --output JSON | jq '.runs | .[]? | .run_id')
  echo "cancelling run '$RUN_ID'"
  databricks runs cancel --run-id "$RUN_ID" > /dev/null

  sleep 5s
done

# start the job again
echo "staring job $JOB_ID"
databricks jobs run-now --job-id "$JOB_ID"

Note the jq syntax to check array for nulls if the job has no current runs.

Tips

The easiest way to set up CLI (especially in CI/CD environment) is to set 2 environment variables - DATABRICKS_HOST and DATABRICKS_TOKEN.

P.S. Originally posted on my blog.

Top comments (1)

Siva • Mar 19

JOB_ID=$(databricks jobs list --output JSON | jq -r --arg I "$JOB_NAME" '.jobs[] | select (.settings.name == $I) | .job_id')
while running above line, it throwing error for me like jq: error (at :371): Cannot index array with string "jobs"

could you please help me here , I have the similar kind of requirement

DEV Community

Automating Databricks with Bash

Create or Update a Cluster Instance Pool

Create or Update a Cluster

Create or Update a Job by Name

Terminate All Job Runs and Start Again

Tips

Top comments (1)

Read next

Quick guide to install minikube and enable ingress controller on Ubuntu

How to install Wave AI Terminal on Linux

From a Unified Bronze Layer to Multiple Silver Layers: Streamlining Data Transformation in Databricks Unity Catalog

Match a String with Regular Expression in Bash