DEV Community

Peter Rombouts for Sogeti

Posted on • Originally published at peterrombouts.nl on

Real-World Partitioning CosmosDb

Azure Cosmos DB is a globally distributed, multimodel database service designed to help you achieve fast, predictable performance. It scales seamlessly along with your application as it grows.

This blog explains exactly how it works and what to do in some scenarios. But in my work as Cloud Architect I sometimes need to dig deeper and figure out if my solution scales for a particular case.

For example, a ‘partition’ has a maximum size of 10GB, so it is really, really important to understand how this works, and how you can figure out your partition size over the time.

This blog will explain, using a recent real-world scenario, how you can determine your partition key and scaling.

Please note that your scenario can differ hugely, but there are some takeways in this blog that could help you when designing your solution.

The casus

In my scenario, I’m pulling in data from telemetry sensors. These sensors have usage readings per 15 minutes. One day worth of data is roughly 10 KB per day in size; all in JSON format.

Sample piece of JSON with 4 of the in total 96 datapoints:

https://gist.github.com/prombouts/eaa8bcf5887cdf1ab9d6d35164c73d79.js

The consumers of my CosmosDb (which call an API which in turn calls the CosmosDb) have a couple of requirements:

  1. Get data per month per set of sensors (max 20) (aggregated totals for all days)
  2. Get data per day per set of sensors (max 20) (aggregated totals per day) for a complete month

Determine partition key and size

In this scenario we should investigate what the proper partition key should be. Because all calls are monthly based, a reasonable key will be month. CosmosDb works very fast if you query only one single partition, and will give you back the documents really fast.

In our case, a partition would then be ‘201601’ for Januari ’16 and ‘201602’ for Februari and so forth.

What happens if we just put all the data, as ingested in quarterhourly json of 10KB a day, into this scheme? For sake of argument we take 31 days for one month (as that is the maximum of days of a month). The total per sensor per month will then be 310KB.

We know a partition is 10GB max. so if we divide that by 310KB, we can store up to 33825 unique sensors in a partition. In some cases this is fine, but in my scenario, I have up to 250000 sensors…

Optimizing your collection

Optimization can be done by any of the following means:

  1. Choosing another partition key
  2. Reducing the size of the JSON document
  3. Save aggregated data instead of all rough values

In my case, the reasonable solution was to save aggregated data. The rough data is also being saved, albeit to another collection or blob for future reference.

By aggregating data, the document suddenly looks like this:

Sample piece of JSON with 4 of the in total 31 (days) datapoints:

https://gist.github.com/prombouts/8d6126556ef2ee61adf006870b0861cd.js

This aggregation makes the total size of a document for one month approx. 5 KB. If we divide 10GB now, we see we can fit in 2 million unique sensors per partition. This easily fits the quarter million required sensors, and my solution is now fast and scalable for future additional sensors in this system. Secondly, the 5 KB obviously are served faster over the internet than 310 KB, and if you want to get a lot of monthly sensor data at the same time, this also is a huge advantage.

Conclusion

CosmosDb is great for storing very large amounts of data. But this does not mean you can stuff anything into your collections without thinking about the implications, and without designing a proper solution architecture.

  1. Always make sure your data fits into your collection, and keep the maximum partition size in mind.
  2. Keep track of how your data is being queried.
    1. If for some reason querying based on year instead of month is the new default, it could be useful to refactor your solution or create an extra collection.
  3. It could be useful to use an API for your consumers so you create a facade and your consumers do not need to know your inner workings of your setup.
    1. If the collection or documents change, your API does not have to!

Top comments (0)