Okay, so this is probably a bit of a niche post but still.
Something I see a lot of is people asking questions on how to do things like run arbitrary SQL statements on SQL databases from Databricks, or how to execute stored procedures. Often this leads to the same follow-ups on how to install PyODBC along with Unix Drivers, adding Microsoft's packages repo, accepting EULA's etc...
It works, don't get me wrong, but it's a faff especially if you don't have permission to execute those things. Plus Microsoft's packages repo has had... issues in the past, and suddenly production jobs fail because they can't connect. I've posted about this before, so there are ways around it, but still, it's a faff.
So, what if you want to connect to SQL and installing PyODBC isn't an option?
Yep, you can do this using only what is already available, and pretty easily, all thanks to Py4J. Py4J is important in Spark because Spark runs on the JVM, so how do all of those PySpark calls get executed? Well, it calls the Java/Scala methods under the hood through Py4J. And this works not just for Spark.
To make things like the Azure SQL Databricks connector work it ships with the Microsoft SQL JDBC library (along with others such as PostgreSQL), and we can access these in Python.
To keep things as secure as possible I'm going to show how to do this using a service principal for access, but this works just as well with SQL-based authentication as well.
So first I'm going to assume some things.
- You have a SQL database you can connect to
- A service principal (or SQL account) exists which has permissions
- You have access to Databricks
- The credentials are in a secret scope (if not then why not!)
The first thing we need is a reference to
SQLServerDataSource = spark._sc._gateway.jvm.com.microsoft.sqlserver.jdbc.SQLServerDataSource
And that's the magic line which lets us access something in the JVM from Python. So we now have a Python variable which references this object. Now we can use it.
client_id = dbutils.secrets.get(secret_scope, '<secret name>')
client_secret = dbutils.secrets.get(secret_scope, '<secret name>')
datasource = SQLServerDataSource()
Here we are getting the service principal's application id and client secret from our secret scope. Then we're creating an instance of
SQLServerDataSource and configuring it to connect to our database using AAD based authentication (for other options see the connection string settings documentation).
Now, we're read to execute something. So lets do a simple query to get the list of users.
connection = datasource.getConnection()
statement = connection.createStatement()
results = statement.executeQuery('SELECT name FROM sysusers')
So, we get a connection from the data source, get a statement object, execute a query, and then iterate on the results.
And that's pretty much it!
There are other methods which allow you to prepare statements which take parameters in a secure way, and you can use the
execute method instead if you're not expecting results (such as when calling a stored procedure). Or maybe you want to make sure a database schema exists before writing your DataFrame to SQL.
statement.execute("IF NOT EXISTS (SELECT 1 FROM sys.schemas WHERE [name] = 'MyCoolSchema') BEGIN EXEC('CREATE SCHEMA [MyCoolSchema]') END")
There are samples of how to do this in the data source sample documentation.
But we haven't had to install any new drivers or packages to make this happen.
Yes. Don't go crazy with this. Spark is a Big Data platform used for processing huge amounts of data, it's not intended for making lots of small queries, and this can reduce connection availability for the Spark JDBC operations. But as with anything, if you use the tools the way they are intended to be used then you shouldn't have any issues.