Darren Fuller

Posted on Jan 15, 2022

Nu Shell and Databricks

#nushell #databricks #commandline #nu

I'm a big fan of the command line. It's often something that can seem daunting to people at first, but with a little time and patience you can often speed up many tasks just by knowing some useful commands and how to chain them together.

Most of the time I'm in Powershell which, thanks to Powershell Core, is now cross-platform and incredibly powerful. But I'm finding myself also using Nu more and more. In both shells I also use the Databricks CLI a lot. Want to check the status of jobs? Use the CLI. Want to upload and download data? Use the CLI. And so on.

Whilst the Databricks CLI is useful, there's times where I want a little more power over it. Such as, using the CLI to find a Databricks runtime version which is under Long Term Support (LTS) and is Photon enabled. So, I can do this using, for instance, the Databricks CLI and some jq. But I'm also lazy and wanted something that's a bit easier to query, and displays nicer, and is easier to output to something like CSV afterwards.

Well, I can get all of that from Nushell. The only downside is that it's quite a few commands to get the data into the right shape to make querying it easy. So, instead, lets do the tedious bits and save them as a command aliases. So, lets fire up Nushell and give it a go.

First up, lets find our config file.

> config path
C:\Users\DarrenFuller\AppData\Roaming\nushell\nu\config\config.toml

Yours will look different to this, but this is the file we need to add our command aliases to.

Now, lets work out what our command looks like. I want to create a command that calls the Databricks CLI for the runtime versions and adds some useful information such as if it's an LTS version. So what does that look like?

>  databricks clusters spark-versions 
    | from json 
    | get versions 
    | insert isLTS { get name | str contains "LTS" } 
    | insert isML { get name | str contains "ML" } 
    | insert photonEnabled { get name | str contains -i "Photon" }
    | insert details { get name | parse "{runtime} (includes Apache Spark {spark},{remainder}" }
    | insert runtime { get details.runtime } 
    | insert spark { get details.spark }
    | reject details

I've put that over multiple lines to make it easier to read, but if you want to run it you'll need to have it all on the same line, like this.

> databricks clusters spark-versions | from json | get versions | insert isLTS { get name | str contains "LTS" } | insert isML { get name | str contains "ML" } | insert photonEnabled { get name | str contains -i "Photon" } | insert details { get name | parse "{runtime} (includes Apache Spark {spark},{remainder}" } | insert runtime { get details.runtime } | insert spark { get details.spark } | reject details

So what's it doing? Lets break it down a bit.

databricks clusters spark-versions

Run the Databricks CLI to get the available runtime information

from json

Parses the response from JSON as a table

get versions

Gets the "version" part of the response object

insert isLTS { get name | str contains "LTS" }

Adds a new "isLTS" column by looking for the term "LTS" in the runtime name

insert isML { get name | str contains "ML" }

Adds a new "isML" column by looking for the term "ML" in the runtime name

insert photonEnabled { get name | str contains -i "Photon"

Adds a new "photonEnabled" column by doing a case-insensitive search for "Photon" in the runtime name

insert details { get name | parse "{runtime} (includes Apache Spark {spark},{remainder}"

Adds a new "details" column by parsing the name name and extracting key information (in curly braces)

insert runtime { get details.runtime }

Adds a new "runtime" column by getting the runtime information from the details column

insert spark { get details.spark }

Adds a new "spark" column by getting the spark version information from the details column

reject detail

Removes the "details" column

That's a lot of commands to run each time, so lets instead save this as a command alias in our config file.

startup = [
    "alias dbx-runtimes = ( databricks clusters spark-versions | from json | get versions | insert isLTS { get name | str contains \"LTS\" } | insert isML { get name | str contains \"ML\" } | insert photonEnabled { get name | str contains -i \"Photon\" } | insert details { get name | parse \"{runtime} (includes Apache Spark {spark},{remainder}\" } | insert runtime { get details.runtime } | insert spark { get details.spark } | reject details )"
]

Here I've aliased the command with the name dbx-runtimes. I've also had to escape the double-quotation marks. But now that we have this we can run all of the above by simply calling the alias.

> dbx-runtimes

────┬──────────────────────────────────┬────────────────────────────────────────────────────────────────────┬───────┬───────┬───────────────┬────────────────────────────┬───────
 #  │               key                │                                name                                │ isLTS │ isML  │ photonEnabled │          runtime           │ spark
────┼──────────────────────────────────┼────────────────────────────────────────────────────────────────────┼───────┼───────┼───────────────┼────────────────────────────┼───────
  0 │ 6.4.x-esr-scala2.11              │ 6.4 Extended Support (includes Apache Spark 2.4.5, Scala 2.11)     │ false │ false │ false         │ 6.4 Extended Support       │ 2.4.5
  1 │ 7.3.x-cpu-ml-scala2.12           │ 7.3 LTS ML (includes Apache Spark 3.0.1, Scala 2.12)               │ true  │ true  │ false         │ 7.3 LTS ML                 │ 3.0.1
  2 │ 7.3.x-hls-scala2.12              │ 7.3 LTS Genomics (includes Apache Spark 3.0.1, Scala 2.12)         │ true  │ false │ false         │ 7.3 LTS Genomics           │ 3.0.1
  3 │ 10.2.x-gpu-ml-scala2.12          │ 10.2 ML (includes Apache Spark 3.2.0, GPU, Scala 2.12)             │ false │ true  │ false         │ 10.2 ML                    │ 3.2.0
  4 │ 7.3.x-gpu-ml-scala2.12           │ 7.3 LTS ML (includes Apache Spark 3.0.1, GPU, Scala 2.12)          │ true  │ true  │ false         │ 7.3 LTS ML                 │ 3.0.1
  5 │ 8.4.x-photon-scala2.12           │ 8.4 Photon (includes Apache Spark 3.1.2, Scala 2.12)               │ false │ false │ true          │ 8.4 Photon                 │ 3.1.2
  6 │ 10.1.x-photon-scala2.12          │ 10.1 Photon (includes Apache Spark 3.2.0, Scala 2.12)              │ false │ false │ true          │ 10.1 Photon                │ 3.2.0
  7 │ 9.1.x-photon-scala2.12           │ 9.1 LTS Photon (includes Apache Spark 3.1.2, Scala 2.12)           │ true  │ false │ true          │ 9.1 LTS Photon             │ 3.1.2
  8 │ 10.2.x-photon-scala2.12          │ 10.2 Photon (includes Apache Spark 3.2.0, Scala 2.12)              │ false │ false │ true          │ 10.2 Photon                │ 3.2.0
  9 │ 8.3.x-scala2.12                  │ 8.3 (includes Apache Spark 3.1.1, Scala 2.12)                      │ false │ false │ false         │ 8.3                        │ 3.1.1
 10 │ 9.0.x-photon-scala2.12           │ 9.0 Photon (includes Apache Spark 3.1.2, Scala 2.12)               │ false │ false │ true          │ 9.0 Photon                 │ 3.1.2
 11 │ 8.4.x-cpu-ml-scala2.12           │ 8.4 ML (includes Apache Spark 3.1.2, Scala 2.12)                   │ false │ true  │ false         │ 8.4 ML                     │ 3.1.2
 12 │ 10.1.x-gpu-ml-scala2.12          │ 10.1 ML (includes Apache Spark 3.2.0, GPU, Scala 2.12)             │ false │ true  │ false         │ 10.1 ML                    │ 3.2.0
 13 │ 9.1.x-scala2.12                  │ 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)                  │ true  │ false │ false         │ 9.1 LTS                    │ 3.1.2
 14 │ 10.0.x-cpu-ml-scala2.12          │ 10.0 ML (includes Apache Spark 3.2.0, Scala 2.12)                  │ false │ true  │ false         │ 10.0 ML                    │ 3.2.0
 15 │ 9.0.x-gpu-ml-scala2.12           │ 9.0 ML (includes Apache Spark 3.1.2, GPU, Scala 2.12)              │ false │ true  │ false         │ 9.0 ML                     │ 3.1.2
 16 │ 9.0.x-scala2.12                  │ 9.0 (includes Apache Spark 3.1.2, Scala 2.12)                      │ false │ false │ false         │ 9.0                        │ 3.1.2
 17 │ 8.3.x-cpu-ml-scala2.12           │ 8.3 ML (includes Apache Spark 3.1.1, Scala 2.12)                   │ false │ true  │ false         │ 8.3 ML                     │ 3.1.1
 18 │ 10.1.x-cpu-ml-scala2.12          │ 10.1 ML (includes Apache Spark 3.2.0, Scala 2.12)                  │ false │ true  │ false         │ 10.1 ML                    │ 3.2.0
 19 │ 10.0.x-scala2.12                 │ 10.0 (includes Apache Spark 3.2.0, Scala 2.12)                     │ false │ false │ false         │ 10.0                       │ 3.2.0
 20 │ apache-spark-2.4.x-esr-scala2.11 │ Light 2.4 Extended Support (includes Apache Spark 2.4, Scala 2.11) │ false │ false │ false         │ Light 2.4 Extended Support │ 2.4
 21 │ 10.1.x-scala2.12                 │ 10.1 (includes Apache Spark 3.2.0, Scala 2.12)                     │ false │ false │ false         │ 10.1                       │ 3.2.0
 22 │ 9.1.x-cpu-ml-scala2.12           │ 9.1 LTS ML (includes Apache Spark 3.1.2, Scala 2.12)               │ true  │ true  │ false         │ 9.1 LTS ML                 │ 3.1.2
 23 │ 10.2.x-scala2.12                 │ 10.2 (includes Apache Spark 3.2.0, Scala 2.12)                     │ false │ false │ false         │ 10.2                       │ 3.2.0
 24 │ 10.2.x-cpu-ml-scala2.12          │ 10.2 ML (includes Apache Spark 3.2.0, Scala 2.12)                  │ false │ true  │ false         │ 10.2 ML                    │ 3.2.0
 25 │ 8.3.x-photon-scala2.12           │ 8.3 Photon (includes Apache Spark 3.1.1, Scala 2.12)               │ false │ false │ true          │ 8.3 Photon                 │ 3.1.1
 26 │ 10.0.x-photon-scala2.12          │ 10.0 Photon (includes Apache Spark 3.2.0, Scala 2.12)              │ false │ false │ true          │ 10.0 Photon                │ 3.2.0
 27 │ 10.0.x-gpu-ml-scala2.12          │ 10.0 ML (includes Apache Spark 3.2.0, GPU, Scala 2.12)             │ false │ true  │ false         │ 10.0 ML                    │ 3.2.0
 28 │ 8.4.x-scala2.12                  │ 8.4 (includes Apache Spark 3.1.2, Scala 2.12)                      │ false │ false │ false         │ 8.4                        │ 3.1.2
 29 │ 9.1.x-gpu-ml-scala2.12           │ 9.1 LTS ML (includes Apache Spark 3.1.2, GPU, Scala 2.12)          │ true  │ true  │ false         │ 9.1 LTS ML                 │ 3.1.2
 30 │ apache-spark-2.4.x-scala2.11     │ Light 2.4 (includes Apache Spark 2.4, Scala 2.11)                  │ false │ false │ false         │ Light 2.4                  │ 2.4
 31 │ 7.3.x-scala2.12                  │ 7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12)                  │ true  │ false │ false         │ 7.3 LTS                    │ 3.0.1
 32 │ 8.4.x-gpu-ml-scala2.12           │ 8.4 ML (includes Apache Spark 3.1.2, GPU, Scala 2.12)              │ false │ true  │ false         │ 8.4 ML                     │ 3.1.2
 33 │ 9.0.x-cpu-ml-scala2.12           │ 9.0 ML (includes Apache Spark 3.1.2, Scala 2.12)                   │ false │ true  │ false         │ 9.0 ML                     │ 3.1.2
 34 │ 8.3.x-gpu-ml-scala2.12           │ 8.3 ML (includes Apache Spark 3.1.1, GPU, Scala 2.12)              │ false │ true  │ false         │ 8.3 ML                     │ 3.1.1
────┴──────────────────────────────────┴────────────────────────────────────────────────────────────────────┴───────┴───────┴───────────────┴────────────────────────────┴───────

Your output might look different depending on when you run the command.

But from this we can now start adding in some filters to get to the records we want. So if I want to find all of the runtimes which are Long Term Support but aren't ML instances I can do the following.

> dbx-runtimes | where isLTS | where isML == $false | sort-by key

───┬────────────────────────┬────────────────────────────────────────────────────────────┬───────┬───────┬───────────────┬──────────────────┬───────
 # │          key           │                            name                            │ isLTS │ isML  │ photonEnabled │     runtime      │ spark
───┼────────────────────────┼────────────────────────────────────────────────────────────┼───────┼───────┼───────────────┼──────────────────┼───────
 0 │ 7.3.x-hls-scala2.12    │ 7.3 LTS Genomics (includes Apache Spark 3.0.1, Scala 2.12) │ true  │ false │ false         │ 7.3 LTS Genomics │ 3.0.1
 1 │ 7.3.x-scala2.12        │ 7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12)          │ true  │ false │ false         │ 7.3 LTS          │ 3.0.1
 2 │ 9.1.x-photon-scala2.12 │ 9.1 LTS Photon (includes Apache Spark 3.1.2, Scala 2.12)   │ true  │ false │ true          │ 9.1 LTS Photon   │ 3.1.2
 3 │ 9.1.x-scala2.12        │ 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)          │ true  │ false │ false         │ 9.1 LTS          │ 3.1.2
───┴────────────────────────┴────────────────────────────────────────────────────────────┴───────┴───────┴───────────────┴──────────────────┴───────

A lot simpler to read, and very easy to now work with. And if I want to save the results I could just add | save runtimes.csv and I'll have a csv with the same data in it.

I've done the same with the Databricks cluster node types as well, though that is a lot less complex than the above one, but it makes being able to query for the information a lot simpler. And with Nushell providing great features for filtering, displaying, and getting data, it's a smooth and easy workflow.

DEV Community

Nu Shell and Databricks

Top comments (0)

Read next

AWS Serverless: Develop, Test, and Deploy with AWS Lambda's New Code Editor and SAM template

꽁머니의 모든 것: 토토사이트에서 시작하는 스마트한 베팅 팁

Understanding Distributed Ledger Technology: A Comprehensive Guide

Docker Distributed Storage: GlusterFS vs. Ceph for Persistent Container Data