DEV Community

Cover image for Switching from SQL Server to Azure CosmosDB
Peter Davis
Peter Davis

Posted on

Switching from SQL Server to Azure CosmosDB

Switching from SQL Server to Cosmos DB

For the past 20 years or so I’ve been using and developing solutions using relational databases, primarily SQL Server, but I’ve also used Oracle and MySQL.

As I began to plan out development for Panache Sports my initial thoughts were to do what I’ve always done and use a relational database to store all my data. I’d normally look at using MySQL to keep costs down, but with the resources provided by Microsoft Startups the option to use a managed Azure SQL database was also available.

This would have been the easiest route to go for, but as I looked at the data I’d be storing for Panache Sports, and how I was building it using serverless Azure functions, I wondered if there was another way to do things.

Enter Azure Cosmos DB.

With Panache Sports being designed as a serverless SaaS (Software as a Service) solution, being able to take advantage of a high availability, scalable database like Cosmos DB was very appealing. But there was a big downside, throwing away 20 years of relational database modelling to move to a NoSQL, unstructured database was going to have a significant learning curve.

So, if you’ve never looked at Cosmos DB, or have wondered if it might be applicable to your use case, here’s my introduction, including some of the decisions around why I choose to go down this route.

Data Decisions

Lets first look at how I would have structured my data using a relational database. This will be a very high level simplified example, but it should cover some of the key concepts.

Panache Sports is designed to store data about sports organisations and the staff and players of those organisations. So let’s think of a simple example structure for storing that information.

Example Panache Sports database structure

We’ve approached things in the normal relational DB way. We’ve normalised our data so that we have tables storing specific chunks of information. We’re using header and detail tables to split the data we store for users and organisations, and tables like Address are shared between users, organisations and Players so we keep that data in a single consistent location.

For organisations, we’re storing the type of sport (football, Baseball, Formula One etc.) but we link to a separate table via an Id so that we can add new sports types and change names and descriptions without changing anything on our organisation, knowing we’ll always get the latest data.

All looks fine and normal, you’ probably make changes, but this is just a simple example.

So how would we move this to Cosmos DB.

Moving our relational structure to Cosmos DB

Cosmos DB, in its NoSQL mode, which is what I’ll be discussing here, stores unstructured data in JSON format. So rather than creating rigid table structures, you can simply store any JSON document in a Cosmos DB container that you want, even completely different structured JSON documents in the same container.

At a very high level Cosmos DB consists of Databases, Containers and Items.

Simplified Cosmos DB structure

Databases contain a group of Containers and then each container stores items.

Containers can have Triggers, Stored Procedures and Functions which allow you to maintain the items they store. These triggers and functions can be built as fully functional .NET Azure functions.

Items within a container are JSON documents and individual documents within the same container can have completely different structures, however you must define a partition key for the container and all items in a container must use the same partition key, even if the rest of the data is different.

Partition keys are simply a property on your JSON items and define how data is logically ordered within a container, and Cosmos DB will then distribute data across physical partitions automatically. The partition Keys are used to for writing and updating and this key can also be used in “where” clauses to provide efficient data retrieval.

Logical partitions can store 20GB of data, so making decisions on how you partition your data is important.

I’ll probably create a separate post that covers containers and partitions but for more background take a look at https://learn.microsoft.com/en-us/azure/cosmos-db/resource-model

Cosmos DB allows for Geo-Replication of data so you can make sure that data is always accessible in a physical location close to your users, however the cost increases significantly with this.

In Cosmos DB RUs are what matter. Rather than basing pricing tiers on storage space or compute Cosmos DB pricing is based on RUs, Request Units, per second.

An RU is 1 point read of 1kb of data in Cosmos DB. So if we need to read 20 kb of data, that’s going to be around 20 RUs.

You can get started with the free tier of Cosmos DB that provides 1000 RUs per second, and If you want a good introduction to RUs check out this Cosmos DB essentials episode https://www.youtube.com/watch?v=3naCwuXhLlk

Although we could build our Cosmos DB data structure in exactly the same way as we built our relational data structure, we start hitting issues with our RU charge increasing.

Microsoft provides a really good example of how you might model data here https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/model-partition-example and I suggest having a read through this.

But let’s take a look at our original data. If we want to get a complete Organisation record we need to query 4 separate tables and this begins to significantly increase our RUs.

Tables to be modelled in Cosmos DB

If I split the same data up in Cosmos DB I’d have 4 items holding parts of the data I need so when looking up an organisation I’d need to lookup and return 4 objects and then reconstruct this data, this greatly increases the RUs required to look up a single Organisation record.

For Cosmos DB, I’m not so much worried about the size of database, as storage costs are increasingly becoming smaller and smaller, instead I want to limit the amount of lookups I’m performing.

So how about this for storing our data:

Example Organisation record

Now if we want to get an organisation record we only require one lookup, and because a JSON document is small the size isn’t really a factor.

If we have an API frontend to our system then the API call to get an organisation returns the data exactly as it is stored in Cosmos DB, I don’t need to use separate classes/objects for my database and my APIs.

What’s not to like right?

….Oh dear, I can hear a collective sigh being let out by all the DB Admins I’ve ever worked with.

For one, what about the sport type? If “Formula One” was to change it’s name to “Formula Super One” I’d have to update every record that was using it, instead of a single record.

If I want to find out all the addresses I have in the system I’ll need to look through every organisation, user and player record rather than on a single table.

Thinking differently

Because I’m trying to prevent additional lookups wherever possible I’m going to end up duplicating data across my records, and this is where in Cosmos DB we need to throw out everything we’ve previously learnt about normalising our data and instead make decisions on the frequency that data will be read and updated.

In my organisation record above, yes if I have 20 teams in the system and “Formula One” changes it’s name to “Formula Super One” I’m going to have to change 20 records, rather than just 1. But what is the frequency of that? It’s very unlikely that data like that will change so I will make the decision to take the hit on a large multi-record update at some point in the future knowing that I am saving on reads and speed of access on a day to day basis.

In Cosmos DB you want to de-normalise your data, storing duplicates of information in records to speed up reads if you know the likelihood of that data changing is small. We’ll probably be okay with storing duplicate addresses across multiple records because we know that the address of a teams stadium or head office isn’t likely to change all that often.

What about if I wanted to get a list of my organisations. One way of doing that would be to add a document type and then store a separate list of organisations, something like this:

Organisation record with 'Type' property added

Here I’ve added a type of “organisation” to my record and I can store another record with a completely different structure in the same Cosmos DB container with a type of “organisationlist”:

Organisation list record

Now, if I want to display a list of organisations I look for a document with the type of “organisationlist” and return a single record with summary data, rather than querying for multiple organisations. If I then want to get the full organisation I’ve got the Id in this list so I can perform a direct lookup to get it.

Instead of doing a lookup that returns 20 organisation records we will return 1 list record, and if we need more information we can perform a second lookup as required.

Again, we’re duplicating data between the master record and the list record, and we have to keep these in sync, but we’re doing that with data we know isn’t going to change all that often so we’re prioritising the read operations for speed and cost.

This de-normalisation goes against everything we try to achieve in a relational database, and how I’ve learnt to structure things over the past 20 years, so it really takes some getting used to.

At a basic level, we’re structuring our data to prioritise how it’s going to be accessed. If we know it won’t change often but will be read frequently we’ll prioritise the reading of that data and duplicate information across records, where as if we know that the data will be updated frequently we may decide to normalise it as we would have done in a relational database and prioritise the writes.

In Cosmos DB the storage cost isn’t really the concern, the RUs are, and so how you structure your data will depend on how you think it will be accessed.

I’ve worked a lot with large ERP systems that store vast amounts of transactional data, like invoices, purchase orders and wages, with hundreds if not thousands of records moving through the system daily. For that data a relational database may make sense, where you avoid duplication of data and update small chunks of information rather than large records. But for Panache Sports we’re dealing with Organisations and player/staff data, this is unlikely to change on a day-to-day or even week-to-week basis, as such we can make use of Cosmos DB to efficiently store and retrieve our information frequently, and only need to worry about updates every now and then.

I’ll look to cover some further topics around Cosmos DB, like how you structure containers and their partition keys, which are a crucial component of making sure you can organise your data appropriately, but for now I hope this has provided a good introduction to how you need to think differently about structuring your data if you’re thinking about a move from a relational database to Cosmos DB.

Like everything, Cosmos DB isn’t appropriate for every situation, or all types of data, but it may work for you in specific areas.

Pete

Buy Me A Coffee

Top comments (2)

Collapse
 
hlexnc profile image
HlexNC

This was an interesting read. However, I am having trouble imagining that duplicating data would reduce search time. Doesn't this also increase the size of your DB by 20x? If you have 20 duplicates. Plus, don't BDs such as SQL Server and CosmosDB sort the data when the values are added in?

Collapse
 
panachesoftwaredev profile image
Peter Davis

It depends on how you are searching for your data. CosmosDB is designed for large data volumes and if you are referencing an item by its ID this performs a point read and will always be fast, regardless of the number of items in the database.

This is why it's crucial to look at how you are going to be accessing your data and attempt to store it in a way that takes advantage of this.

...Of course there are exceptions here, CosmosDB is not really aimed at storing large documents, so a point read of a large document will potentially be more costly that a query bringing together smaller documents.

It's a complex juggling act to reduce the reads by duplicating the data (if your application will be read heavy), but then also keep the size of the individual documents down. It's definitely not suited to all use cases!