Cosmosdb and Heterogeneous data

#azure #cosmosdb #nosql #data

A selection of different watches. They all tell the time, but some are analogues, some are digital, some are branded, and some are not. — Same, but different

CosmosDb, in common with other NoSQL databases, is schema-free. In other words, it doesn’t validate incoming data by default. This is a feature, not a bug. But it’s a dramatic change in thinking, akin to moving to a dynamically typed language from a statically typed one (and not, as it might first appear, moving from a strongly typed to a weakly typed one).

For those of us coming from a SQL or OO background, it’s tempting to use objects, possibly nested, to represent and validate the data, and hence encourage all the data within a collection to have the same structure (give or take some optional fields). This works, but it doesn’t provide all the benefits of moving away from a structured database. And it inherits from classic ORMs the migration problem when the objects and schema need to change. It can very easily lead to a fragile big-bang deployment.

For those of us used to dynamic languages and are comfortable with Python’s duck typing or the optional-by-default sparse mapping required to use continuously-versioned JSON-based RESTful services, there’s an obvious alternative. Be generous in what you accept.

If I have a smart home, packed with sensors, I could create a subset of core data with time, sensor identifier and a warning flag. So long as the website knows if that identifier is a smoke alarm or a thermostat, it can alert the user appropriately. But on top of that, the smoke alarm can store particle count, battery level, mains power status, a flag for test mode enabled, and the thermostat can have a temperature value, current programme state, boiler status, etc, both tied into the same stream.

Why would I want to do this?

Versioning

Have historic and current data from a device/user in one place, recorded accurately as how it was delivered (so that you can tweak the algorithm to fix that timedrift bug) rather than having to reformat all your historical data when you know only a small subset will ever be read again.

Data siblings

Take all the similar data together for unified analysis – such as multiple thermostat models with the same base properties but different configurations. This allows you to generate a temperature trend across devices, even as the sensors change, if sensors are all from different manufacturers, and across anything with a temperature sensor.

Co-location

If you’re making good use of cosmosdb partitions you may want to keep certain data within a partition to optimise queries. For example, a customer, all of their devices, and aggregated summaries of their activity. You can do this by partitioning on the customer id, and collecting the different types of data into one collection.

Conclusion

NoSQL is not 3NF, so throw put those textbooks and start thinking of data as more dynamic and freeform. You can still enforce structure if you want to, but think about if you’re causing yourself pain further down the road.

Check out @craignicol’s Tweet: https://twitter.com/craignicol/status/1122224379658633217?s=09