aminekaabachi

Posted on Oct 3, 2020

Building a Python SDK for Azure Databricks

#azure #python #hacktoberfest #databricks

This article is about a new project I started to work on lately. Please welcome Azure Databricks SDK Python. As it’s shining through the name 🦄, It is a high-quality Python SDK for Azure Databricks REST API 2.0.

This article will present the project, the current progress, release plan, some design choices, and at final dev process/tools.

I spent several nights working and searching for best practices to implement this SDK. I am convinced that It will need more than my humble efforts to become stable and usable in production. Therefore, my friends, all contributions are welcome! Do not hesitate to reach to me if you want to contribute by any means (docs, code, testing, etc).

Motivation

I will start by my own use-cases:

Mix and match: When working on a competitive field, you will have to cope with having new tools and platforms popping out from nowhere. I like to have simple means to mix my most-used tools with the brand-new ones. Using APIs to integrate directly is time-consuming, which leads in most cases to hacks and error-prone automation.
Keeping evolving: Usually, the ecosystem does not integrate the preview features. For example, Data Factory still does not include the docker options (DCS) when creating a cluster in the Databricks activity. Sometimes, you need to force the update. An SDK could be very useful to build custom connectors.
Custom bricks: You want to build on top of Azure Databricks: let's imagine You have a great idea that you must prototype as fast as you can, for a startup or a hackathon. If some of its blocks could be done by Azure Databricks, an SDK will help you do it in no time. You will be able to bundle everything in your app or package (et voilà !).

I think It's plenty to justify the need. Nonetheless, I am quite sure there should be additional use-cases. I'll be excited to hear yours in the comments (on medium).

Specs

I want to create an SDK with he following features:

Clear standard to access to APIs (e.g. through a client).
Contains custom types for the API results and requests.
Support for Personal Access token authentification.
Support for Azure AD authentification.
Support for the use of Azure AD service principals.
Allows free-style API calls with a force mode (bypass types validation).
Error handeling and proxy support.

In a nutshell, it should support all available authentification methods and manage operations on Azure Databricks through type objects. It should have simple methods to access results (no .get() hell) but also keep the possibility for free-style API calls.

By the way, I looked at other trials on doing this. Some projects use the underlying packages from databricks-cli which I think is a genuine idea. For the fun of it, I wanted to do it from scratch, but also because it should give more flexibility in the future.

Demo / Implementation Progress

This part contains details about the current release, usage demo, and the release plan.

Current release

Current release is v0.0.2. As of this version here is the implementation progress:

✔ Authentification
✔ Custom types (25%)
✔ API Wrappers (25%)
✔ Error handling (80%)
✗ Proxy support (0%)
✔ Documentation (20%)

The following API wrappers are now fully implemented and tested:

✔ Clusters
✔ Secrets
✔ Tokens

Usage demo

Here is a demo from the SDK Quickstart Guide:

Begin by importing the clients.Client class from SDK module.

from azure_databricks_sdk_python import Client

You can now instantiate a client object. You need to pass the databricks instance (format: adb-..azuredatabricks.net) and your token:

client = Client(databricks_instance=<instance>, personal_access_token=<token>)

You can create a new cluster using the following:

cluster = client.clusters.create(attributes)

attributes are instance of types.clusters.ClusterAttributes. So before creating a cluster you need to create define its attributes. Here is an example:

autoscale = AutoScale(min_workers=0, max_workers=1)
attributes = ClusterAttributes(cluster_name="my-cute-cluster", 
                                spark_version="7.2.x-scala2.12",
                                node_type_id="Standard_F4s", 
                                autoscale=autoscale)

Now create will return an instance of types.clusters.ClusterInfo. You can access it's properties through dot chainin, for example:

cluster.cluster_id
>>>  '0918-220215-atria616'

Release plan

The release plan for the next versions (v0.0.3 and v0.0.4) will be as follows.
I will be focusing on v0.0.3, hence contributors can start working on v0.0.4 if they are interested.

v0.0.3: Jobs, Groups.
v0.0.4: DBFS, Libraries, Workspaces.

They should be released in few weeks from now. The goal is to reach a first stable version v0.1.0 this year.

Internals & Design Choices

These choices here should help gain time for change and extension. I think laziness is a good motive, although implicitly pejorative, it is the abstract for some of the software engineering principles (e.g. reuse). To note that extension of the SDK is done through two operations: an API change or addition of an authentification method.

Drafting

The fundamental idea of this SDK is to have a Client object as main interface. It encapsulates API wrappers: e.i. It means that you can call clusters API through client.clusters.create(...).

Figure 1: Three-tier architecture class diagram.

At first, I imagined a three-tier architecture:

White layer: The main interface that handles the configuration and abstracts the calls to the API wrappers (by aggregating them).
Green layer: The API wrappers. Also, the type packages that include models for requests and responses from different endpoints.
Yellow layer: Generic API helpers like functions that do HTTP get and post requests, and also HTTP error handlers.

To test this architecture, I asked two questions:

What do I need to change if I add a new authentification method ? : In this case, I will need to modify the Client class and the API class. The change can lead (Murphy's law) to regression in the SDK. This constraint can be relaxed with a fair amount of tests. Still, It may lead (Murphy's law again) to breaking the OCP once multiple versions accumulate because keeping consistency through versions with one class and stacking implementations is one of the worst practices in dev.
What do I need to change if I add a new API wrapper ? : In this case, I need to modify the Client class too. The main issue here is that Client class is getting many responsibilities: configuration, the aggregation of API wrappers, etc. You can see that It breaks the SRP this time (no Murphies needed).

Finally, It was time to for solutions: I tried my best. Still, much refactoring and evaluation are needed to improve the SDK quality. If you got ideas to improve the current solution or have a completely different way that can help in the long run, please get in touch.

Designing

I like LEGO®. I also think that LEGO® makes good metamodels for programming. LEGO® is a master of SRP through its tiny blocks. Inspired, I tried to form a new solution by separating responsibilities in my layers.

Figure 1: Unscientific multitier architecture class diagram.

I started by introducing these changes:

White layer: The main interface now is just a Factory: i.e. It instantiates childs of BaseClient that now construct the purple layer.
Purple layer: It handles the configuration for each auth method. If I add a new auth method, the existing ones are not affected. The aggregation with API wrappers is delegated (through composition) to a single responsibility class called Composer.
Green layer: The green layer API wrappers all aggregate in a Composer class. It frees the purples to handle only configuration logic.
Yellow layer: The yellow layer is now divided into a Factory API class and separate classes that handle generic and specific HTTP operations based on the auth method.

As for changes, minimal additions are now needed for the usual extension use-case. It seems to be a good start for now.

Implementing

Apart from abstract design choices, the one thing I hate the most in dealing with raw APIs is what I call ".get() hell". Typing, is a powerful concept. When dealing with APIs it helps a big deal. However, the challenge is the following. Azure Databricks API data structures are usually trees of basic types. You can find up to 3 or 4 layers deep. This makes the job of parsing input and output in order to return and accept custom type objects a challenging task.

Here are the two libs that made it very easy to solve this challenge:

attrs: package that will bring back the joy of writing classes by relieving you from the drudgery of implementing object protocols (aka dunder methods). This was useful mainly for Type Annotations.
cattrs is an open source Python library for structuring and unstructuring data. cattrs works best with attrs classes and the usual Python collections, but other kinds of classes are supported by manually registering converters.

Let's look at an example that uses attrs and cattrs:


>>> from enum import unique, Enum
>>> from typing import List, Optional, Sequence, Union
>>> from cattr import structure, unstructure
>>> import attr
>>>
>>> @unique
... class CatBreed(Enum):
...     SIAMESE = "siamese"
...     MAINE_COON = "maine_coon"
...     SACRED_BIRMAN = "birman"
...
>>> @attr.s
... class Cat:
...     breed: CatBreed = attr.ib()
...     names: Sequence[str] = attr.ib()
...
>>> @attr.s
... class DogMicrochip:
...     chip_id = attr.ib()
...     time_chipped: float = attr.ib()
...
>>> @attr.s
... class Dog:
...     cuteness: int = attr.ib()
...     chip: Optional[DogMicrochip] = attr.ib()
...

Note that init methods (and more) for the @attr.s are automatically generated. Now we can convert to and from these models using unstructure and structure functions.


>>> p = unstructure([Dog(cuteness=1, chip=DogMicrochip(chip_id=1, time_chipped=10.0)),
...                  Cat(breed=CatBreed.MAINE_COON, names=('Fluffly', 'Fluffer'))])
...
>>> print(p)
[{'cuteness': 1, 'chip': {'chip_id': 1, 'time_chipped': 10.0}}, {'breed': 'maine_coon', 'names': ('Fluffly', 'Fluffer')}]
>>> print(structure(p, List[Union[Dog, Cat]]))
[Dog(cuteness=1, chip=DogMicrochip(chip_id=1, time_chipped=10.0)), Cat(breed=<CatBreed.MAINE_COON: 'maine_coon'>, names=['Fluffly', 'Fluffer'])]

As You can see, this helps to pass from low-level representation to structured data and vice versa. I used cattrs to handle input and responses. I also used attr to model all the types from the API and our implementation seems very clean compared to official APIs that struggle to implement it in a clean and readable way (check this out and compare it to this).

My Process / Tools

The CICD process for linting, testing, publishing the package use Github actions.

Developement

Let's start with a preview of the building workflow:


name: Unit Tests
on:
  push:
    branches: [ master ]
    paths:
      - "azure_databricks_sdk_python/**.py"
      - "tests/**.py"
      - ".github/workflows/**.yml"
      - ".coveragerc"
      - "requirements.txt"
      - "requirements-tests.txt"

jobs:
  coverage:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v2

      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: 3.8.2

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install coveralls coverage
          if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
          if [ -f requirements-test.txt ]; then pip install -r requirements-test.txt; fi
      - name: Run Test Suite
        env:
          DATABRICKS_INSTANCE: ${{ secrets.DATABRICKS_INSTANCE }}
          PERSONAL_ACCESS_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          DATABRICKS_INSTANCE_PREMIUM: ${{ secrets.DATABRICKS_INSTANCE_PREMIUM }}
          PERSONAL_ACCESS_TOKEN_PREMIUM: ${{ secrets.PERSONAL_ACCESS_TOKEN_PREMIUM }}
        run: |
          pytest --cov azure_databricks_sdk_python --junitxml=junit/test-results.xml tests/
      - name: Send Results to Coveralls
        env:
          DATABRICKS_INSTANCE: ${{ secrets.DATABRICKS_INSTANCE }}
          PERSONAL_ACCESS_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          DATABRICKS_INSTANCE_PREMIUM: ${{ secrets.DATABRICKS_INSTANCE_PREMIUM }}
          PERSONAL_ACCESS_TOKEN_PREMIUM: ${{ secrets.PERSONAL_ACCESS_TOKEN_PREMIUM }}
          COVERALLS_REPO_TOKEN: ${{ secrets.COVERALLS_REPO_TOKEN }}
        run: |
          coveralls

I used pytest and coveralls for the two steps. Env variables for my Azure Databricks test workspaces are provided and the values are aggregated from Github repo secrets.

I also included a workflow to automatically publish the package when a release is made:

name: Publish to PyPI

on:
  release:
    types: [created]

jobs:
  deploy:

    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v2
    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.x'
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install setuptools wheel twine pbr
    - name: Build and publish
      env:
        TWINE_USERNAME: ${{ secrets.TWINE_USERNAME }}
        TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }}
        TWINE_NON_INTERACTIVE: true 
      run: |
        python setup.py sdist bdist_wheel
        twine upload dist/*

It uses twine to publish the package.

Documentation

I used sphinx with some extensions and readthedocs.org.

This screencast by Mahdi Yusuf will help you get started If you want to contribute to the docs: https://www.youtube-nocookie.com/embed/oJsUvBQyHBs

Resources

Please let me know what you think and You can keep up with the project through:

Originally posted in kaabachi.io

DEV Community