Thanks to tools like Azure Databricks, we can build simple data pipelines in the cloud and use Spark to get some comprehensive insights into our data with relative ease. Combining this with the Apache Spark connector for Cosmos DB, we can leverage the power of Azure Cosmos DB to gain and store some incredible insights into our data.
It’s been a while since I’ve written a post on Databricks and since I’ve been working with Cosmos DB quite a bit over the past few months, I’d thought I’d write a simple tutorial on how you can use Azure Blob Storage, Azure Databricks and Cosmos DB to build a straightforward data pipeline that does some simple transformations on our source data.
I’m also going to throw a bit of Azure Key Vault into the mix to show you how simple it can be to protect vital secrets in Databricks such as Storage account keys and Cosmos DB endpoints!
This blog post is mainly aimed at beginners. Ideally you would have some idea of what each component is and you’d have some understanding of Python.
Alright, enough chat! Let’s dive in.
About our file
Our source file contains data about various football players taken from the FIFA database. I downloaded it from a Datacamp course I took ages ago (If someone knows which course, let me know and I’ll credit it properly). Our data file has the following columns:
- Height (In cm)
- Foot (which one is most dominant)
- Rare (This is a 0 or 1. Looking at the dataset and from what I know about Football, I’m guessing that 1 is rare and 0 is common. Of the top of my head I’m not sure why this column exists, so we’ll ignore it for now)
- Various skill levels (Pace, shooting, passing, dribbling, defending, heading, diving, handling, kicking, reflexes, speed and positioning).
In this tutorial, we just want to find out which players have a rating over over 85 and persist those players to a collection in Cosmos DB. Really simple stuff just to show how everything fits together.
So we’ve introduced the source file, let’s put it in the cloud!
Creating an Azure Storage Account
If you’re not sure what Azure Storage is, it’s a cloud store for things like blobs, files, messages and NoSQL stores. Using Azure Storage we can make sure our data is secure and easily accessible.
I’m going to create a Storage account and create a container inside our storage account to hold Blobs. Doing this is pretty easy.
In the Azure Portal, click ‘Create a resource’ and choose Storage account. If you can see it, use the search bar to find it.
Next we’ll have to configure it. First off, we’ll need to place it in a resource group. This is essentially just a collections of Azure resources. These come in handy when we’re provisioning resources through CI/CD and it’s also useful when managing resources (like getting rid of resources that we don’t need after doing a tutorial).
Once that’s done, we can give our storage account a name , a location (I’ve selected Australia East since that’s the closest data center to me), what kind of performance we want, what type of account we want, what r eplication level we need and what access tier we want. If you want more explanation as to what each of these settings are, check out the Azure docs.
We can also initialize some advanced settings when creating Storage accounts, like allowing access from specified virtual networks. Again, basic tutorial, just ignore for now. Provided that your configuration is valid, click create and wait for your Storage account to be ready.
Now that our account is ready, we can create a container to store our blobs in.
Click on Blobs under Services, and then click on the Container tab to create our first container. Think of a container in this sense as just a folder within a file directory (not a Docker container). Give a name and a public access level.
If you make a container public, everything inside that container will be accessible to the public. Just keep it private for now.
And just like that, our container has been created!
Alright, click on it and then click upload! I’m going to upload our source file into the container. If the file already exists within your container, you’ll need to overwrite it if it already exists.
Click on Upload and if all is good, you’ve just uploaded a raw data-set into the cloud with a couple of clicks!
Obviously this wouldn’t be the way we’d upload blobs into our Storage account for real apps, but it’s good enough for now.
So we’ve created a Storage account and uploaded our file, let’s move on to provisioning a Databricks workspace.
Creating a Databricks Workspace
Azure Databricks is a unified analytics platform that allows Data Scientists, Data Engineers and Business users to come together to gain advanced insights into their data using the power of a managed Apache Spark service on Azure.
To create a workspace, head back to the Azure Portal and click ‘Create new resource’. In the search, type in Databricks and then select the option for Azure Databricks (see below for an example, provided the portal hasn’t changed)
Now we’ll have to give it some settings. Give your workspace a name , assign it the existing resource group that we created for our storage account, deploy it in a datacenter location near you and assign a pricing tier (I went for premium). You do have the option of deploying your Databricks workspace inside a Virtual Network, but that’s beyond our requirements.
Once our workspace has been deployed, we can navigate to it by clicking on Launch Workspace. This will open up the Databricks UI.
In order to run notebooks and jobs in Databricks, we’ll need to create a cluster. In the Databricks UI, click on the Clusters button in the sidebar and then click Create Cluster.
When we create Clusters in Databricks, we can give them the following settings:
- Name: The name of our cluster
- Cluster Mode: We can create two types of clusters in Databricks. For clusters that have multiple users using them at the same time, we would create a high concurrency cluster (These don’t support Scala). For this tutorial, we’ll create a Standard cluster. These do support the Scala language and they’re designed for single user use.
- Pool: This is a new feature! We can assign a cluster to a pool to reduce startup time. Haven’t played around with this much though.
- Databricks Runtime Version: Sets the image that Databricks will use to create the cluster. This will create a cluster that has a particular version of Scala and Spark
- Python Version: We can choose between Python 2.7 or 3.5 as the version of Python that the cluster uses.
- Enable autoscaling: When we create a cluster, we set the minimum and maximum number of worker nodes that are available to the cluster. If we enable autoscaling, Databricks will automatically scale the amount of worker nodes that it uses when executing Spark jobs.
- Terminate after period: This setting determines the amount of time that we want to pass before the cluster automatically shuts down.
- Worker and Driver Type: Clusters in Databricks essentially run on Virtual Machines. We can set what kind of VM we want our nodes to be on both a worker and driver basis. We can also set them to have the same type of VM if we wish.
- Advanced Options: In here, we can configure our Spark configuration, configure the location where logs are written, load any scripts that we want to run on initialization etc. We won’t use this at all so don’t worry about this for now.
Once you’ve finished giving your cluster its settings, click create and after a few minutes your cluster should be created. It’ll appear underneath the ‘Interactive Clusters’ section in the Cluster UI.
Let’s leave Databricks for now. We have a workspace and cluster all set up (If it’s running, you can go ahead and shut it off). We’ll come back to install our libraries later.
Creating a Cosmos DB account
For this tutorial, I’m creating a Cosmos DB account that uses the SQL API (It’s also called the Core API, but I’ll refer to it as the SQL API for now). This will mean that we’ll be using the Spark connector that supports the SQL API. There’s different connectors for the MongoDB API and for the Cassandra API, so if you’re using those API’s for your Cosmos account or are more familiar with those data stores, make sure you use those connectors instead!
To create our Cosmos DB account, head back to the Azure Portal and click Create a resource. Search for Azure Cosmos DB and click Create :
In the UI, put your new Cosmos DB account in the same resource group that we’ve used for the Storage account and Databricks workspace. Give it a name and choose the Core (SQL) API as the API that our Cosmos DB account will use. Put it in a location near to you and disable Geo-Redundancy and Multi-region writes for now. Click Create to deploy it!
This can take around 4 minutes to create, so you can grab a cup of tea or coffee if you like. By the time you get back, you should have a Cosmos DB account ready and raring to go!
We now need to create a container to store our data. Containers in Cosmos are called different things in different API’s, but you can just think of it as a table to store documents in.
Because we are creating a container for the first time, we will also need to create a new Database to hold that collection in. You should provide the following settings.
- Database Id: This is the name of our database.
- Throughput: This is the amount of Throughput that we provision towards our Cosmos artifacts. I’ve written an article that goes into depth on what Throughput is in Cosmos.
- Container Id: This is the name for our collection.
- Partition Key: This will be the property that we partition our collections on. I’ve also written an article that goes into more depth on Partitioning.
Click OK to provision our new collection. After a short while, navigate to the Data Explorer in the Cosmos Portal and we should see our new Database and Collection.
So now we’ve provisioned our Storage account (Data Source location), a Databricks workspace (place to process our data) and a Cosmos DB collection (which will act as our sink).
We have one final step to complete before we get started. In order to access these resources, we have endpoints and access keys that we need to connect the pieces together. Ideally, we shouldn’t exposes these keys out to the public or to anyone else who might use them maliciously. So the final Azure resource we need to create is a Key Vault to protect our secrets.
Creating a Azure Key Vault and adding our secrets.
Using Azure Key Vault is a simple way to manage our secrets and our keys that we want to protect. Let’s set one up to store our secrets for our Storage Account and our Cosmos DB endpoint and access keys.
You know the drill by now! Head to the Azure Portal, click create a resource and search for Key Vault.
Give your key vault a name , add it to our existing resource group , stick it in a region near you and select a pricing tier. I’ve chosen the Standard tier for this tutorial.
Now that we have our Key Vault, we can add our secrets to it. Go into your Cosmos DB account and select keys from the sidebar. You should see the below screen (Without the terribly censored key values of course!). Copy the endpoint to your clipboard for now.
Once you’ve done that head back to your Key Vault and select Secrets from the Sidebar. Click on Genereate/Import to generate a new secret for our Cosmos endpoint
When you create a secret, you need to specify how you want to upload the secret, give it a name and add the value , here’s an example:
Click create to create the secret and you should see the secret in your secrets list.
Repeat this for your Cosmos DB Primary Key and your Azure Storage Primary Key.
You will also need to configure your Databricks workspace to use Azure Key Vault as a secrets store. The full guide on how to do this can be read here.
Sweet! We’ve set up all the required components we need. We now just need to do one last thing before we get started coding in our notebook.
Installing and Importing our libraries
Databricks comes with a variety of different libraries pre-installed on our clusters that we can use to work with our data. We can also install libraries onto our Clusters that we need.
In order to use the Apache Spark Connector for Cosmos DB in Python, we need to install pyDocmentDB , which is the client library for Cosmos DB for Python (At time of writing of course, the Cosmos DB team recently released v3 of their .NET client, which I believe is the start of a long processes of changing all the SDKs from DocumentDB (Old product name) to Cosmos DB (Shiny new product name)).
Installing libraries onto our cluster is pretty simple. We need to make sure that our cluster is running in order to do so. If it’s terminated, fire it up and once it’s running, click on the Libraries hyperlink in the right hand corner of the cluster div:
From here, you’ll be able to see which libraries have been installed on the cluster. As you can see, I’ve gone ahead and installed the pyDocumentDB library and the Apache Spark Connector for Cosmos DB uber jar on here already, but I’ll go through as to how you can install pyDocumentDB onto your cluster:
Click on Install New button and you should see the below dialog come up.
As you can see, there are a variety of different ways that we can install libraries onto our cluster. For our purposes, we’ll import it straight from PyPI. Type in pyDocumentDB into the Package textbox and click Install. Wait a while and it’ll be installed on your cluster!
Phew! We’ve done a lot for what should be a basic tutorial. If you’ve made it this far, you should be proud of yourself! You’ve set up all the components required for a basic data pipeline in the cloud!
Now that we’ve set everything up, let the coding begin!
Mount the Storage Account to Databricks
In order to start coding in Databricks, we’ll need to create a notebook. Select Workspace on the Databricks UI Sidebar and click the dropdown arrow next to the Workspace text. Click Create > Notebook to see the following dialog:
We have to give our notebook a name, a language and assign a cluster to it. We’ll be using Python for this tutorial, so use that. It will use the version that you assigned to the cluster when you created it, so in my case, we’ll be using 3.5.
Once that’s done, we can start adding code! (finally!). Let’s start by adding the necessary pyDocumentDB Library:
Next, we’ll mount our storage account to our notebook. This allows us to mount our container into Databricks and then use it as if it was a local file within our Databricks workspace:
In order to mount a storage account to our Databricks notebook, we need to define
- Our source system (In this case, it will be wasbs://firstname.lastname@example.org)
- Our mount point (we can name this whatever we like)
- Provide extra configs , which is essentially our key scope (Azure Key Vault) and the name of our key that we want to call (Look at your secrets list in Key Vault for this).
We can take a look inside our new mount point by using the DBFS API and the Display() function to see the file:
By executing this command, we can see the csv file in our mount point.
We’ve mounted the file, now let’s do some simple transformations.
Transforming our file into a dataset in Databricks
We’ve got a source file in our notebook, so we now have to transform it into a DataFrame in order to work with it.
For our purposes, we want to get a players rating, their height, what position they play, what their dominant foot is, their name and we want to filter on players who have a rating of 85 or higher.
Let’s turn our source file into a Spark DataFrame. Using the Spark API to do this is pretty simple. We just need to specify what form the file is in, whether or not we want to read a header, whether or not we should infer a schema on read and finally set the path from which we want to read the file from (This will be our mount point/filename.csv ). Let’s do this and apply the result to a DataFrame called footballPlayersDF.
Now that our file has become a DataFrame, we can start to apply some transformations on it. Let’s start by creating a new DataFrame called elitePlayersDF that only selects players from footballPlayersDF whose rating is equal or greater than 85/100.
We create a new DataFrame for each transformation because it makes parallel processing easier and it allows us to manipulate our distributed data in a safe manner. Let’s create a final DataFrame that only selects the columns we need for our final collection in Cosmos DB.
We can call the Display() function on our DataFrames to have a quick look at what our DataFrames looks like. Databricks will automatically limit the results to 1000 rows (if there are more records than this). Call Display() on our dataset and we should see something like this:
We’ve finished our transformations! Now we can use the Apache Spark connector for Cosmos DB to write our DataFrame to our collection.
Write out the result to Cosmos DB
In order to write our DataFrame to Cosmos DB, we need to provide the writeConfig object with our C osmos DB endpoint, Master Key , the name of which database we wish to write to, the name of the collection that we want to write our data to and we have to set the Upsert value to true.
Once the job has finished executing, navigate back to your Cosmos DB collection and click on Items inside your collection to view the persisted items in Cosmos DB! Pretty cool right?!
Deleting Resources and Wrap up!
There you have it! It’s not the most complex example, all we’ve done here is take a simple csv file, uploaded it to blob storage, read it in Azure Databricks, do some really basic filtering on it using the Spark API and then persisted the result to Cosmos DB using the Apache Spark Connector for Cosmos DB.
The point I’m hope I made is that using barely any lines of code and having a basic understanding of how to create Azure resources, we’ve built the foundations of building some amazing analytical data pipelines in Azure.
If you’ve been following along, make sure you clean up your resources once you’re done (just to save money!). To do this, go to your Azure Portal and click on Resource Groups in the Sidebar.
From here, click on the resource group where you’ve deployed all your resources to (Hopefully they’re all in the same one!).
Click on the ellipsis next to the resource group name and then select ‘ Delete resource group ’. You should see the following dialog pop up.
Just type in your resource group name to delete the resources and the group itself. Wait a bit and they should be gone.
That’s the end of this tutorial! Obviously if you’re building data pipelines in Azure for production scenarios, it’s more complex and sophisticated than this (No mention of a CI/CD pipeline here). But what I’m hoping you got out of this is a simple explanation of a potential workflow that you could use for your needs.
If you have any questions or comments, let me know! I’d love to hear them.